Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
# Imblearn libary is used to handle imbalanced data
# Jupyter notebook
!pip install imblearn --user
!pip install imbalanced-learn --user
# Anaconda prompt
#!pip install -U imbalanced-learn
#conda install -c conda-forge imbalanced-learn
# Restart the kernel after successful installation of the library
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: imblearn in /usr/local/lib/python3.7/dist-packages (0.0) Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.7/dist-packages (from imblearn) (0.8.1) Requirement already satisfied: scikit-learn>=0.24 in /usr/local/lib/python3.7/dist-packages (from imbalanced-learn->imblearn) (1.0.2) Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.7/dist-packages (from imbalanced-learn->imblearn) (1.7.3) Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from imbalanced-learn->imblearn) (1.1.0) Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.7/dist-packages (from imbalanced-learn->imblearn) (1.21.6) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.24->imbalanced-learn->imblearn) (3.1.0) Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.7/dist-packages (0.8.1) Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from imbalanced-learn) (1.1.0) Requirement already satisfied: scikit-learn>=0.24 in /usr/local/lib/python3.7/dist-packages (from imbalanced-learn) (1.0.2) Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.7/dist-packages (from imbalanced-learn) (1.7.3) Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.7/dist-packages (from imbalanced-learn) (1.21.6) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.24->imbalanced-learn) (3.1.0)
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# To be used for missing value imputation
from sklearn.impute import SimpleImputer, KNNImputer
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 2 decimal points
pd.set_option("display.float_format", lambda x: "%.2f" % x)
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
make_scorer,
)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
# to use standard scaler
from sklearn.preprocessing import StandardScaler
# To undersample and oversample the data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
train_data = pd.read_csv("/content/Train.csv.csv")
test_data = pd.read_csv("/content/Test.csv.csv")
train_data.head()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.46 | -4.68 | 3.10 | 0.51 | -0.22 | -2.03 | -2.91 | 0.05 | -1.52 | 3.76 | -5.71 | 0.74 | 0.98 | 1.42 | -3.38 | -3.05 | 0.31 | 2.91 | 2.27 | 4.39 | -2.39 | 0.65 | -1.19 | 3.13 | 0.67 | -2.51 | -0.04 | 0.73 | -3.98 | -1.07 | 1.67 | 3.06 | -1.69 | 2.85 | 2.24 | 6.67 | 0.44 | -2.37 | 2.95 | -3.48 | 0 |
| 1 | 3.37 | 3.65 | 0.91 | -1.37 | 0.33 | 2.36 | 0.73 | -4.33 | 0.57 | -0.10 | 1.91 | -0.95 | -1.26 | -2.71 | 0.19 | -4.77 | -2.21 | 0.91 | 0.76 | -5.83 | -3.07 | 1.60 | -1.76 | 1.77 | -0.27 | 3.63 | 1.50 | -0.59 | 0.78 | -0.20 | 0.02 | -1.80 | 3.03 | -2.47 | 1.89 | -2.30 | -1.73 | 5.91 | -0.39 | 0.62 | 0 |
| 2 | -3.83 | -5.82 | 0.63 | -2.42 | -1.77 | 1.02 | -2.10 | -3.17 | -2.08 | 5.39 | -0.77 | 1.11 | 1.14 | 0.94 | -3.16 | -4.25 | -4.04 | 3.69 | 3.31 | 1.06 | -2.14 | 1.65 | -1.66 | 1.68 | -0.45 | -4.55 | 3.74 | 1.13 | -2.03 | 0.84 | -1.60 | -0.26 | 0.80 | 4.09 | 2.29 | 5.36 | 0.35 | 2.94 | 3.84 | -4.31 | 0 |
| 3 | 1.62 | 1.89 | 7.05 | -1.15 | 0.08 | -1.53 | 0.21 | -2.49 | 0.34 | 2.12 | -3.05 | 0.46 | 2.70 | -0.64 | -0.45 | -3.17 | -3.40 | -1.28 | 1.58 | -1.95 | -3.52 | -1.21 | -5.63 | -1.82 | 2.12 | 5.29 | 4.75 | -2.31 | -3.96 | -6.03 | 4.95 | -3.58 | -2.58 | 1.36 | 0.62 | 5.55 | -1.53 | 0.14 | 3.10 | -1.28 | 0 |
| 4 | -0.11 | 3.87 | -3.76 | -2.98 | 3.79 | 0.54 | 0.21 | 4.85 | -1.85 | -6.22 | 2.00 | 4.72 | 0.71 | -1.99 | -2.63 | 4.18 | 2.25 | 3.73 | -6.31 | -5.38 | -0.89 | 2.06 | 9.45 | 4.49 | -3.95 | 4.58 | -8.78 | -3.38 | 5.11 | 6.79 | 2.04 | 8.27 | 6.63 | -10.07 | 1.22 | -3.23 | 1.69 | -2.16 | -3.64 | 6.51 | 0 |
test_data.head()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.61 | -3.82 | 2.20 | 1.30 | -1.18 | -4.50 | -1.84 | 4.72 | 1.21 | -0.34 | -5.12 | 1.02 | 4.82 | 3.27 | -2.98 | 1.39 | 2.03 | -0.51 | -1.02 | 7.34 | -2.24 | 0.16 | 2.05 | -2.77 | 1.85 | -1.79 | -0.28 | -1.26 | -3.83 | -1.50 | 1.59 | 2.29 | -5.41 | 0.87 | 0.57 | 4.16 | 1.43 | -10.51 | 0.45 | -1.45 | 0 |
| 1 | 0.39 | -0.51 | 0.53 | -2.58 | -1.02 | 2.24 | -0.44 | -4.41 | -0.33 | 1.97 | 1.80 | 0.41 | 0.64 | -1.39 | -1.88 | -5.02 | -3.83 | 2.42 | 1.76 | -3.24 | -3.19 | 1.86 | -1.71 | 0.63 | -0.59 | 0.08 | 3.01 | -0.18 | 0.22 | 0.87 | -1.78 | -2.47 | 2.49 | 0.32 | 2.06 | 0.68 | -0.49 | 5.13 | 1.72 | -1.49 | 0 |
| 2 | -0.87 | -0.64 | 4.08 | -1.59 | 0.53 | -1.96 | -0.70 | 1.35 | -1.73 | 0.47 | -4.93 | 3.57 | -0.45 | -0.66 | -0.17 | -1.63 | 2.29 | 2.40 | 0.60 | 1.79 | -2.12 | 0.48 | -0.84 | 1.79 | 1.87 | 0.36 | -0.17 | -0.48 | -2.12 | -2.16 | 2.91 | -1.32 | -3.00 | 0.46 | 0.62 | 5.63 | 1.32 | -1.75 | 1.81 | 1.68 | 0 |
| 3 | 0.24 | 1.46 | 4.01 | 2.53 | 1.20 | -3.12 | -0.92 | 0.27 | 1.32 | 0.70 | -5.58 | -0.85 | 2.59 | 0.77 | -2.39 | -2.34 | 0.57 | -0.93 | 0.51 | 1.21 | -3.26 | 0.10 | -0.66 | 1.50 | 1.10 | 4.14 | -0.25 | -1.14 | -5.36 | -4.55 | 3.81 | 3.52 | -3.07 | -0.28 | 0.95 | 3.03 | -1.37 | -3.41 | 0.91 | -2.45 | 0 |
| 4 | 5.83 | 2.77 | -1.23 | 2.81 | -1.64 | -1.41 | 0.57 | 0.97 | 1.92 | -2.77 | -0.53 | 1.37 | -0.65 | -1.68 | -0.38 | -4.44 | 3.89 | -0.61 | 2.94 | 0.37 | -5.79 | 4.60 | 4.45 | 3.22 | 0.40 | 0.25 | -2.36 | 1.08 | -0.47 | 2.24 | -3.59 | 1.77 | -1.50 | -2.23 | 4.78 | -6.56 | -0.81 | -0.28 | -3.86 | -0.54 | 0 |
df = train_data.copy()
df.head(3)
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.46 | -4.68 | 3.10 | 0.51 | -0.22 | -2.03 | -2.91 | 0.05 | -1.52 | 3.76 | -5.71 | 0.74 | 0.98 | 1.42 | -3.38 | -3.05 | 0.31 | 2.91 | 2.27 | 4.39 | -2.39 | 0.65 | -1.19 | 3.13 | 0.67 | -2.51 | -0.04 | 0.73 | -3.98 | -1.07 | 1.67 | 3.06 | -1.69 | 2.85 | 2.24 | 6.67 | 0.44 | -2.37 | 2.95 | -3.48 | 0 |
| 1 | 3.37 | 3.65 | 0.91 | -1.37 | 0.33 | 2.36 | 0.73 | -4.33 | 0.57 | -0.10 | 1.91 | -0.95 | -1.26 | -2.71 | 0.19 | -4.77 | -2.21 | 0.91 | 0.76 | -5.83 | -3.07 | 1.60 | -1.76 | 1.77 | -0.27 | 3.63 | 1.50 | -0.59 | 0.78 | -0.20 | 0.02 | -1.80 | 3.03 | -2.47 | 1.89 | -2.30 | -1.73 | 5.91 | -0.39 | 0.62 | 0 |
| 2 | -3.83 | -5.82 | 0.63 | -2.42 | -1.77 | 1.02 | -2.10 | -3.17 | -2.08 | 5.39 | -0.77 | 1.11 | 1.14 | 0.94 | -3.16 | -4.25 | -4.04 | 3.69 | 3.31 | 1.06 | -2.14 | 1.65 | -1.66 | 1.68 | -0.45 | -4.55 | 3.74 | 1.13 | -2.03 | 0.84 | -1.60 | -0.26 | 0.80 | 4.09 | 2.29 | 5.36 | 0.35 | 2.94 | 3.84 | -4.31 | 0 |
df.shape
(20000, 41)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 19982 non-null float64 1 V2 19982 non-null float64 2 V3 20000 non-null float64 3 V4 20000 non-null float64 4 V5 20000 non-null float64 5 V6 20000 non-null float64 6 V7 20000 non-null float64 7 V8 20000 non-null float64 8 V9 20000 non-null float64 9 V10 20000 non-null float64 10 V11 20000 non-null float64 11 V12 20000 non-null float64 12 V13 20000 non-null float64 13 V14 20000 non-null float64 14 V15 20000 non-null float64 15 V16 20000 non-null float64 16 V17 20000 non-null float64 17 V18 20000 non-null float64 18 V19 20000 non-null float64 19 V20 20000 non-null float64 20 V21 20000 non-null float64 21 V22 20000 non-null float64 22 V23 20000 non-null float64 23 V24 20000 non-null float64 24 V25 20000 non-null float64 25 V26 20000 non-null float64 26 V27 20000 non-null float64 27 V28 20000 non-null float64 28 V29 20000 non-null float64 29 V30 20000 non-null float64 30 V31 20000 non-null float64 31 V32 20000 non-null float64 32 V33 20000 non-null float64 33 V34 20000 non-null float64 34 V35 20000 non-null float64 35 V36 20000 non-null float64 36 V37 20000 non-null float64 37 V38 20000 non-null float64 38 V39 20000 non-null float64 39 V40 20000 non-null float64 40 Target 20000 non-null int64 dtypes: float64(40), int64(1) memory usage: 6.3 MB
df.duplicated().sum()
0
df.isnull().sum()
V1 18 V2 18 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64
df[df['V1'].isnull()==True]
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 89 | NaN | -3.96 | 2.79 | -4.71 | -3.01 | -1.54 | -0.88 | 1.48 | 0.57 | -1.10 | -1.85 | 4.54 | 4.49 | 0.71 | -2.14 | -2.03 | 0.14 | 2.79 | -1.17 | 4.87 | -3.92 | 1.49 | -0.17 | -6.47 | 3.01 | -3.13 | 3.96 | -1.90 | -0.64 | -0.54 | -1.88 | -8.33 | -5.14 | 1.12 | -0.31 | 5.32 | 3.75 | -5.63 | 2.37 | 2.20 | 0 |
| 5941 | NaN | 1.01 | 1.23 | 5.40 | 0.06 | -2.71 | -2.03 | 0.53 | 3.01 | -2.36 | -5.71 | -1.62 | -0.05 | -0.51 | -3.03 | -5.00 | 6.43 | 0.77 | 1.24 | 5.86 | -3.85 | 1.71 | 1.02 | 2.31 | 1.16 | 0.39 | -4.91 | 1.45 | -2.54 | -0.52 | -2.75 | 1.87 | -3.12 | -0.55 | 1.71 | -2.26 | 0.41 | -3.43 | -1.30 | -1.77 | 0 |
| 6317 | NaN | -5.21 | 2.00 | -3.71 | -1.04 | -1.59 | -2.65 | 0.85 | -1.31 | 2.41 | -2.70 | 3.52 | 6.08 | 1.89 | -6.30 | -2.35 | -3.71 | 4.06 | -0.37 | 1.62 | -5.27 | 2.43 | 2.35 | 0.06 | -0.47 | -1.31 | 1.87 | -2.45 | -2.91 | 1.17 | 1.49 | 3.07 | -0.07 | -0.28 | 3.20 | 7.02 | 1.30 | -4.58 | 2.96 | -2.36 | 0 |
| 6464 | NaN | 2.15 | 5.00 | 4.19 | 1.43 | -6.44 | -0.93 | 3.79 | -0.68 | -0.74 | -8.19 | 6.68 | 4.11 | -0.65 | -4.76 | -1.71 | 4.04 | -0.46 | 4.03 | 3.83 | -5.31 | 0.93 | 2.93 | 4.46 | -0.35 | 4.86 | -5.04 | -0.77 | -5.67 | -2.64 | 1.85 | 5.23 | -5.11 | 1.75 | 2.59 | 3.99 | 0.61 | -4.27 | 1.86 | -3.60 | 0 |
| 7073 | NaN | 2.53 | 2.76 | -1.67 | -1.94 | -0.03 | 0.91 | -3.20 | 2.95 | -0.41 | 0.01 | -0.48 | 2.91 | -0.94 | -0.65 | -6.15 | -2.60 | -0.67 | 0.77 | -2.70 | -6.40 | 2.86 | -1.41 | -2.86 | 2.36 | 3.17 | 5.59 | -1.77 | -2.73 | -3.30 | -0.20 | -4.89 | -2.61 | -1.50 | 2.04 | -0.83 | -1.37 | 0.57 | -0.13 | -0.32 | 0 |
| 8431 | NaN | -1.40 | -2.01 | -1.75 | 0.93 | -1.29 | -0.27 | 4.46 | -2.78 | -1.21 | -2.05 | 5.28 | -0.87 | 0.07 | -0.67 | 1.87 | 3.44 | 3.30 | -0.93 | 0.94 | -0.56 | 2.55 | 6.47 | 4.47 | -0.81 | -2.22 | -3.84 | 0.17 | 0.23 | 2.96 | 0.42 | 4.56 | -0.42 | -2.04 | 1.11 | 1.52 | 2.11 | -2.25 | -0.94 | 2.54 | 0 |
| 8439 | NaN | -3.84 | 0.20 | 4.15 | 1.15 | -0.99 | -4.73 | 0.56 | -0.93 | 0.46 | -4.89 | -1.25 | -1.65 | -0.23 | -5.41 | -2.99 | 4.83 | 4.64 | 1.30 | 6.40 | -1.09 | 0.13 | 0.41 | 6.21 | -1.94 | -3.00 | -8.53 | 2.12 | 0.82 | 4.87 | -2.01 | 6.82 | 3.45 | 0.24 | 3.22 | 1.20 | 1.27 | -1.92 | 0.58 | -2.84 | 0 |
| 11156 | NaN | -0.67 | 3.72 | 4.93 | 1.67 | -4.36 | -2.82 | 0.37 | -0.71 | 2.18 | -8.81 | 2.56 | 1.96 | 0.00 | -5.94 | -4.68 | 3.29 | 1.98 | 4.43 | 4.71 | -4.12 | 1.05 | 0.86 | 6.75 | -0.81 | 1.88 | -4.79 | 1.25 | -6.28 | -2.25 | 0.46 | 6.66 | -2.90 | 3.07 | 2.49 | 4.81 | 0.07 | -1.22 | 3.01 | -5.97 | 0 |
| 11287 | NaN | -2.56 | -0.18 | -7.19 | -1.04 | 1.38 | 1.31 | 1.56 | -2.99 | 1.27 | 3.03 | 3.69 | 0.52 | 0.75 | 2.46 | 3.19 | -4.05 | 1.52 | -2.11 | -3.49 | 0.55 | 0.76 | 1.15 | -2.13 | 0.73 | -2.17 | 5.07 | -2.04 | 1.56 | 0.86 | 3.19 | -2.53 | 0.56 | -1.15 | -0.02 | 4.07 | 0.98 | -0.57 | 0.63 | 3.92 | 0 |
| 11456 | NaN | 1.30 | 4.38 | 1.58 | -0.08 | 0.66 | -1.64 | -4.81 | -0.91 | 2.81 | 0.57 | -0.32 | 0.85 | -2.78 | -3.63 | -5.40 | -4.24 | 0.26 | 5.22 | -3.45 | -4.54 | -0.52 | -5.11 | 3.63 | -2.31 | 4.27 | -0.81 | -0.53 | 0.69 | 1.79 | 0.72 | 1.77 | 5.76 | 1.20 | 5.66 | 0.41 | -2.64 | 5.53 | 2.10 | -4.95 | 0 |
| 12221 | NaN | -2.33 | -0.05 | 0.62 | -0.90 | -2.44 | 0.35 | 2.09 | -2.93 | 2.29 | -3.84 | 6.29 | -1.58 | 0.01 | 0.55 | -1.00 | 3.33 | 1.32 | 5.20 | 3.56 | -0.65 | 2.20 | 2.73 | 4.35 | 0.56 | -4.24 | -0.25 | 2.95 | -3.26 | -0.75 | -2.26 | 0.13 | -5.18 | 5.25 | 0.72 | 3.21 | 1.64 | 1.54 | 1.81 | -2.04 | 0 |
| 12447 | NaN | 0.75 | -0.27 | 1.30 | 2.04 | -1.49 | -0.41 | 0.98 | 0.81 | -0.07 | -3.84 | -1.01 | 1.10 | 1.43 | -1.50 | 0.02 | 1.40 | 0.47 | -2.05 | 0.63 | 0.05 | 0.57 | 2.47 | 1.88 | 0.20 | 1.76 | -1.19 | -0.29 | -3.97 | -3.10 | 2.09 | 4.41 | -2.21 | -1.36 | -1.73 | 1.68 | -0.21 | -2.34 | 0.11 | -0.54 | 0 |
| 13086 | NaN | 2.06 | 3.33 | 2.74 | 2.78 | -0.44 | -2.02 | -0.89 | -1.11 | 0.03 | -2.75 | -1.15 | -1.54 | -2.02 | -2.34 | -1.39 | 1.27 | 1.22 | 0.75 | -0.92 | -0.82 | -1.87 | -2.63 | 5.16 | -1.81 | 4.43 | -5.88 | -0.43 | 0.97 | 1.19 | 3.30 | 5.11 | 4.68 | -1.71 | 2.43 | 1.00 | -1.19 | 1.21 | 0.51 | -0.88 | 0 |
| 13411 | NaN | 2.70 | 4.59 | 1.87 | 2.05 | -0.93 | -1.67 | -1.65 | -0.24 | -0.32 | -2.22 | 0.26 | 1.56 | -2.23 | -3.85 | -2.40 | -0.66 | 0.64 | 1.08 | -1.44 | -2.76 | -1.74 | -3.15 | 2.46 | -1.69 | 6.17 | -3.98 | -1.73 | 0.29 | 0.20 | 2.58 | 2.53 | 3.63 | -1.20 | 2.33 | 1.67 | -0.94 | 0.95 | 1.66 | -1.67 | 0 |
| 14202 | NaN | 7.04 | 2.14 | -3.20 | 4.11 | 3.38 | -1.34 | -4.55 | 1.94 | -5.47 | 2.36 | -1.34 | 3.05 | -4.60 | -6.04 | -4.13 | -2.80 | 4.44 | -6.63 | -8.54 | -4.27 | -0.38 | -1.14 | -0.15 | -3.12 | 11.24 | -5.05 | -5.44 | 5.03 | 2.81 | 1.92 | 0.16 | 9.77 | -10.26 | 0.51 | -1.97 | -0.03 | 3.13 | 0.01 | 4.54 | 0 |
| 15520 | NaN | 1.38 | 3.24 | -3.82 | -1.92 | 0.44 | 1.35 | -2.04 | 1.16 | 0.31 | 2.23 | 0.63 | 3.36 | -0.48 | 0.55 | -2.16 | -5.07 | -1.41 | -0.09 | -3.93 | -4.03 | 0.78 | -2.56 | -4.67 | 1.77 | 3.00 | 6.63 | -2.93 | -0.69 | -2.38 | 2.07 | -5.41 | -0.90 | -1.06 | 1.42 | 1.16 | -1.15 | -0.05 | 0.60 | 0.81 | 0 |
| 16576 | NaN | 3.93 | -0.76 | 2.65 | 1.75 | -0.55 | 1.83 | -0.11 | -3.74 | 1.04 | -0.36 | 5.86 | -4.21 | -3.35 | 1.48 | -0.45 | 2.34 | -0.38 | 6.43 | -3.53 | 0.46 | 0.97 | 2.18 | 8.72 | -2.76 | 1.92 | -4.30 | 2.85 | -0.03 | 1.12 | -1.48 | 3.49 | 1.03 | 2.85 | 1.74 | -2.00 | -0.78 | 8.70 | 0.35 | -2.01 | 0 |
| 18104 | NaN | 1.49 | 2.66 | 0.22 | -0.30 | -1.35 | 0.04 | -0.16 | 1.11 | -0.57 | -2.28 | 0.32 | 1.01 | -0.49 | -0.36 | -2.63 | 0.66 | -0.31 | 0.49 | 0.09 | -3.32 | 1.03 | -0.60 | -0.15 | 1.55 | 2.16 | 0.98 | -0.86 | -2.07 | -2.18 | 1.34 | -1.01 | -2.23 | -0.87 | 1.30 | 0.67 | -0.50 | -1.49 | -0.15 | 0.16 | 0 |
df[df['V2'].isnull()==True]
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 613 | -2.05 | NaN | -1.62 | -3.32 | 0.15 | 0.60 | -1.81 | 0.85 | -1.52 | 0.21 | -0.46 | 2.38 | 1.68 | 0.53 | -3.77 | -1.10 | -0.79 | 4.86 | -1.96 | 0.05 | -2.20 | 2.57 | 3.99 | 2.07 | -1.31 | -2.23 | -1.32 | -0.93 | 0.54 | 3.59 | -0.47 | 3.26 | 2.38 | -2.46 | 1.72 | 2.54 | 1.70 | -1.43 | 0.60 | 0.74 | 0 |
| 2236 | -3.76 | NaN | 0.19 | -1.64 | 1.26 | -1.57 | -3.69 | 1.58 | -0.31 | -0.14 | -4.50 | 1.82 | 5.03 | 1.44 | -8.11 | -2.80 | -0.19 | 5.80 | -3.03 | 2.02 | -5.08 | 3.03 | 5.20 | 3.12 | -1.58 | 0.26 | -3.54 | -2.27 | -2.47 | 2.47 | 1.16 | 7.62 | 1.70 | -3.96 | 2.71 | 4.66 | 1.62 | -5.54 | 1.25 | -1.16 | 0 |
| 2508 | -1.43 | NaN | 0.66 | -2.88 | 1.15 | -0.79 | -1.56 | 2.90 | -2.35 | -0.22 | -1.13 | 2.93 | 2.05 | 0.38 | -3.12 | 1.32 | -1.05 | 3.19 | -2.29 | -1.31 | -2.46 | 1.29 | 3.69 | 3.00 | -1.52 | 0.90 | -2.65 | -2.50 | 0.68 | 3.29 | 3.92 | 6.28 | 3.32 | -4.05 | 3.12 | 3.34 | 0.60 | -3.78 | -0.16 | 1.50 | 0 |
| 4653 | 5.47 | NaN | 4.54 | -2.92 | 0.40 | 2.80 | 0.03 | -7.33 | 1.12 | 1.70 | 1.16 | -2.78 | 0.57 | -3.08 | -1.39 | -8.51 | -6.21 | 1.40 | 0.77 | -9.15 | -6.87 | 2.07 | -4.81 | 1.90 | 0.34 | 7.16 | 4.65 | -2.62 | -1.11 | -2.28 | 3.65 | -1.54 | 4.60 | -4.10 | 4.30 | 0.15 | -3.73 | 6.56 | 0.71 | -0.46 | 0 |
| 6810 | -2.63 | NaN | 2.33 | 1.09 | 0.60 | -1.14 | -0.69 | -1.36 | 0.36 | -1.19 | -1.70 | 3.14 | 2.52 | -2.17 | -3.98 | -3.46 | 0.50 | 1.16 | 1.97 | 0.02 | -3.50 | 0.38 | -0.34 | 0.91 | -1.20 | 3.69 | -2.56 | -0.73 | -0.45 | 0.17 | -1.96 | -0.95 | 0.21 | 0.45 | 1.05 | 0.54 | 0.76 | 1.73 | 1.89 | -1.70 | 0 |
| 7788 | -4.20 | NaN | 2.95 | 0.58 | 4.10 | -0.64 | -2.81 | -0.11 | -1.36 | -0.80 | -1.39 | 0.42 | 3.81 | -1.78 | -7.55 | -1.17 | -3.18 | 2.58 | -1.86 | -5.78 | -4.96 | -0.05 | 1.94 | 6.76 | -4.83 | 9.17 | -7.40 | -4.28 | 0.95 | 3.96 | 6.19 | 12.52 | 9.50 | -7.15 | 5.67 | 1.25 | -2.16 | -0.95 | -0.00 | -1.55 | 0 |
| 8483 | -4.48 | NaN | 1.20 | -2.04 | 2.78 | -0.80 | -5.40 | -1.23 | 1.49 | -0.97 | -5.91 | -0.33 | 7.56 | 0.80 | -12.69 | -7.01 | -1.56 | 8.51 | -5.54 | 0.20 | -8.39 | 4.01 | 5.07 | 3.77 | -2.40 | 4.07 | -4.74 | -4.10 | -3.46 | 2.15 | 1.66 | 9.47 | 4.28 | -7.59 | 3.27 | 5.23 | 1.28 | -5.37 | 1.98 | -1.64 | 0 |
| 8894 | 3.26 | NaN | 8.45 | -3.25 | -3.42 | -3.00 | -0.67 | -0.16 | -0.67 | 3.13 | -2.11 | 3.73 | 5.75 | 0.33 | -1.83 | -3.28 | -5.36 | -1.13 | 3.78 | 0.58 | -7.45 | 0.40 | -4.71 | -3.82 | 2.68 | 1.78 | 7.03 | -3.36 | -3.22 | -2.71 | 4.55 | -4.24 | -3.12 | 2.52 | 5.28 | 7.29 | -0.87 | -4.32 | 3.12 | -2.39 | 0 |
| 8947 | -3.79 | NaN | 0.72 | 2.31 | 0.93 | -0.98 | 0.50 | -0.44 | -2.77 | 1.73 | -1.99 | 4.21 | -2.80 | -2.08 | 0.34 | -1.37 | 2.09 | 0.31 | 5.49 | -0.39 | 0.09 | 0.33 | 0.12 | 6.04 | -1.38 | 0.37 | -2.73 | 2.51 | -1.07 | -0.05 | -1.29 | 1.53 | -0.50 | 3.79 | 1.13 | 0.62 | -0.11 | 5.71 | 1.54 | -2.48 | 0 |
| 9362 | 2.66 | NaN | 2.98 | 4.43 | -0.24 | 0.67 | 0.38 | -7.65 | 4.43 | -0.75 | -1.17 | -3.07 | 0.03 | -3.77 | -1.93 | -10.30 | 0.34 | -1.31 | 4.46 | -2.18 | -5.36 | 1.26 | -5.03 | 0.45 | 0.70 | 6.00 | 0.91 | 1.18 | -2.53 | -4.02 | -4.61 | -5.49 | -1.10 | 1.22 | 0.98 | -4.79 | -2.27 | 7.67 | 0.82 | -3.93 | 0 |
| 9425 | -2.35 | NaN | 2.05 | 0.81 | 2.54 | -0.92 | -0.21 | -0.56 | -0.14 | -2.15 | -3.84 | 2.68 | -0.66 | -2.52 | -1.71 | -2.68 | 3.63 | 2.29 | -0.16 | -0.37 | -1.41 | 0.23 | 0.24 | 2.93 | -0.19 | 4.11 | -4.00 | -0.16 | -0.93 | -1.68 | -0.04 | -0.62 | -0.90 | -1.18 | -1.24 | 1.24 | 1.23 | 2.07 | 1.22 | 1.47 | 0 |
| 9848 | -1.76 | NaN | 2.85 | -2.75 | -0.81 | -0.10 | -1.38 | -1.11 | -0.05 | 0.16 | 0.64 | 2.04 | 4.86 | -0.35 | -4.25 | -1.56 | -3.84 | 1.64 | -0.47 | -0.33 | -3.33 | -0.35 | -1.69 | -3.14 | -0.70 | 1.79 | 1.29 | -2.78 | 0.84 | 1.25 | 0.26 | -2.16 | 1.86 | -0.34 | 1.51 | 3.41 | 0.92 | -1.50 | 2.51 | -0.79 | 0 |
| 11637 | -2.27 | NaN | 1.71 | 1.16 | -0.36 | -5.45 | -0.79 | 3.94 | -1.58 | 0.80 | -8.51 | 8.43 | 2.66 | 0.70 | -3.69 | -3.23 | 5.01 | 2.68 | 4.12 | 5.92 | -5.06 | 4.17 | 5.95 | 4.69 | 1.12 | -1.94 | -1.74 | 1.31 | -7.06 | -2.44 | -1.55 | 2.65 | -8.43 | 3.51 | 1.50 | 5.55 | 2.59 | -3.45 | 2.32 | -2.76 | 0 |
| 12339 | -1.66 | NaN | -0.71 | -4.35 | 1.39 | -0.09 | -2.16 | -0.38 | 0.03 | -0.66 | -5.65 | 2.89 | 2.21 | 0.55 | -5.22 | -5.36 | 2.14 | 8.08 | -4.13 | 1.70 | -3.91 | 4.50 | 4.89 | 2.09 | 0.98 | -1.48 | -0.36 | -0.82 | -3.84 | -1.26 | -1.12 | 0.31 | -2.69 | -3.11 | -1.60 | 5.82 | 3.46 | -1.74 | 2.29 | 2.24 | 0 |
| 15913 | 0.77 | NaN | 5.30 | 0.04 | -1.17 | -2.25 | 0.96 | -0.09 | -0.24 | -1.06 | -2.45 | 5.09 | 0.43 | -2.63 | 0.85 | -2.63 | 2.18 | -0.84 | 3.86 | 1.72 | -2.99 | -0.47 | -3.44 | -1.77 | 2.11 | 2.19 | 0.93 | -0.19 | -0.63 | -2.59 | -0.80 | -7.72 | -4.52 | 3.18 | 0.45 | 2.18 | 1.26 | 0.89 | 2.03 | 0.63 | 0 |
| 18342 | -0.93 | NaN | 2.38 | -1.24 | 3.23 | -2.10 | -2.19 | 0.59 | 1.96 | -5.01 | -7.39 | 3.31 | 3.77 | -1.84 | -7.10 | -6.07 | 4.89 | 6.48 | -4.84 | 0.97 | -6.69 | 3.47 | 4.67 | 2.43 | 0.40 | 5.75 | -5.57 | -2.88 | -2.99 | -1.46 | 0.33 | 1.61 | -1.82 | -6.66 | -0.46 | 3.05 | 2.94 | -3.79 | 0.86 | 3.34 | 0 |
| 18343 | -2.38 | NaN | -0.01 | -1.47 | 1.30 | 0.72 | -1.12 | -3.19 | 3.25 | -4.86 | -0.69 | 2.36 | 5.43 | -2.51 | -7.25 | -5.57 | 0.68 | 4.39 | -3.42 | -0.27 | -4.23 | 1.51 | 1.57 | -3.37 | -1.29 | 4.81 | -2.78 | -2.35 | 0.68 | 0.35 | -5.73 | -5.09 | 0.44 | -3.17 | -2.71 | -0.59 | 3.23 | 1.32 | 2.28 | 1.15 | 0 |
| 18907 | -0.12 | NaN | 3.66 | -1.23 | 1.95 | -0.12 | 0.65 | -1.49 | -0.03 | -2.56 | -2.09 | 2.94 | -0.49 | -3.37 | -0.24 | -2.68 | 1.93 | 1.65 | -0.60 | -2.33 | -1.78 | -0.47 | -2.09 | 0.33 | 0.67 | 5.42 | -1.58 | -1.35 | 0.40 | -2.33 | 0.96 | -4.67 | -0.59 | -1.65 | -1.41 | 1.53 | 1.08 | 2.83 | 1.45 | 3.23 | 0 |
df.describe()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 19982.00 | 19982.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 | 20000.00 |
| mean | -0.27 | 0.44 | 2.48 | -0.08 | -0.05 | -1.00 | -0.88 | -0.55 | -0.02 | -0.01 | -1.90 | 1.60 | 1.58 | -0.95 | -2.41 | -2.93 | -0.13 | 1.19 | 1.18 | 0.02 | -3.61 | 0.95 | -0.37 | 1.13 | -0.00 | 1.87 | -0.61 | -0.88 | -0.99 | -0.02 | 0.49 | 0.30 | 0.05 | -0.46 | 2.23 | 1.51 | 0.01 | -0.34 | 0.89 | -0.88 | 0.06 |
| std | 3.44 | 3.15 | 3.39 | 3.43 | 2.10 | 2.04 | 1.76 | 3.30 | 2.16 | 2.19 | 3.12 | 2.93 | 2.87 | 1.79 | 3.35 | 4.22 | 3.35 | 2.59 | 3.40 | 3.67 | 3.57 | 1.65 | 4.03 | 3.91 | 2.02 | 3.44 | 4.37 | 1.92 | 2.68 | 3.01 | 3.46 | 5.50 | 3.58 | 3.18 | 2.94 | 3.80 | 1.79 | 3.95 | 1.75 | 3.01 | 0.23 |
| min | -11.88 | -12.32 | -10.71 | -15.08 | -8.60 | -10.23 | -7.95 | -15.66 | -8.60 | -9.85 | -14.83 | -12.95 | -13.23 | -7.74 | -16.42 | -20.37 | -14.09 | -11.64 | -13.49 | -13.92 | -17.96 | -10.12 | -14.87 | -16.39 | -8.23 | -11.83 | -14.90 | -9.27 | -12.58 | -14.80 | -13.72 | -19.88 | -16.90 | -17.99 | -15.35 | -14.83 | -5.48 | -17.38 | -6.44 | -11.02 | 0.00 |
| 25% | -2.74 | -1.64 | 0.21 | -2.35 | -1.54 | -2.35 | -2.03 | -2.64 | -1.49 | -1.41 | -3.92 | -0.40 | -0.22 | -2.17 | -4.42 | -5.63 | -2.22 | -0.40 | -1.05 | -2.43 | -5.93 | -0.12 | -3.10 | -1.47 | -1.37 | -0.34 | -3.65 | -2.17 | -2.79 | -1.87 | -1.82 | -3.42 | -2.24 | -2.14 | 0.34 | -0.94 | -1.26 | -2.99 | -0.27 | -2.94 | 0.00 |
| 50% | -0.75 | 0.47 | 2.26 | -0.14 | -0.10 | -1.00 | -0.92 | -0.39 | -0.07 | 0.10 | -1.92 | 1.51 | 1.64 | -0.96 | -2.38 | -2.68 | -0.01 | 0.88 | 1.28 | 0.03 | -3.53 | 0.97 | -0.26 | 0.97 | 0.03 | 1.95 | -0.88 | -0.89 | -1.18 | 0.18 | 0.49 | 0.05 | -0.07 | -0.26 | 2.10 | 1.57 | -0.13 | -0.32 | 0.92 | -0.92 | 0.00 |
| 75% | 1.84 | 2.54 | 4.57 | 2.13 | 1.34 | 0.38 | 0.22 | 1.72 | 1.41 | 1.48 | 0.12 | 3.57 | 3.46 | 0.27 | -0.36 | -0.10 | 2.07 | 2.57 | 3.49 | 2.51 | -1.27 | 2.03 | 2.45 | 3.55 | 1.40 | 4.13 | 2.19 | 0.38 | 0.63 | 2.04 | 2.73 | 3.76 | 2.26 | 1.44 | 4.06 | 3.98 | 1.18 | 2.28 | 2.06 | 1.12 | 0.00 |
| max | 15.49 | 13.09 | 17.09 | 13.24 | 8.13 | 6.98 | 8.01 | 11.68 | 8.14 | 8.11 | 11.83 | 15.08 | 15.42 | 5.67 | 12.25 | 13.58 | 16.76 | 13.18 | 13.24 | 16.05 | 13.84 | 7.41 | 14.46 | 17.16 | 8.22 | 16.84 | 17.56 | 6.53 | 10.72 | 12.51 | 17.26 | 23.63 | 16.69 | 14.36 | 15.29 | 19.33 | 7.47 | 15.29 | 7.76 | 10.65 | 1.00 |
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
for feature in df.columns:
histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None) ## Please change the dataframe name as you define while reading the data
As observed earlier, all the features except the target feature appears to be normally distribution with some having very slight skewness
The target variable is categorical with the data highly imbalanced
df_1 = df.iloc[:,0:11]
df_1['Target'] = df['Target']
df_1
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -4.46 | -4.68 | 3.10 | 0.51 | -0.22 | -2.03 | -2.91 | 0.05 | -1.52 | 3.76 | -5.71 | 0 |
| 1 | 3.37 | 3.65 | 0.91 | -1.37 | 0.33 | 2.36 | 0.73 | -4.33 | 0.57 | -0.10 | 1.91 | 0 |
| 2 | -3.83 | -5.82 | 0.63 | -2.42 | -1.77 | 1.02 | -2.10 | -3.17 | -2.08 | 5.39 | -0.77 | 0 |
| 3 | 1.62 | 1.89 | 7.05 | -1.15 | 0.08 | -1.53 | 0.21 | -2.49 | 0.34 | 2.12 | -3.05 | 0 |
| 4 | -0.11 | 3.87 | -3.76 | -2.98 | 3.79 | 0.54 | 0.21 | 4.85 | -1.85 | -6.22 | 2.00 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 19995 | -2.07 | -1.09 | -0.80 | -3.01 | -2.29 | 2.81 | 0.48 | 0.11 | -0.59 | -2.90 | 8.87 | 1 |
| 19996 | 2.89 | 2.48 | 5.64 | 0.94 | -1.38 | 0.41 | -1.59 | -5.76 | 2.15 | 0.27 | -2.09 | 0 |
| 19997 | -3.90 | -3.94 | -0.35 | -2.42 | 1.11 | -1.53 | -3.52 | 2.05 | -0.23 | -0.36 | -3.78 | 0 |
| 19998 | -3.19 | -10.05 | 5.70 | -4.37 | -5.35 | -1.87 | -3.95 | 0.68 | -2.39 | 5.46 | 1.58 | 0 |
| 19999 | -2.69 | 1.96 | 6.14 | 2.60 | 2.66 | -4.29 | -2.34 | 0.97 | -1.03 | 0.50 | -9.59 | 0 |
20000 rows × 12 columns
sns.pairplot(df_1, hue = 'Target' , diag_kind='hist')
<seaborn.axisgrid.PairGrid at 0x7fa73e8d1d10>
df_2 = df.iloc[:,11:20]
df_2['Target'] = df['Target']
df_3 = df.iloc[:,20:30]
df_3['Target'] = df['Target']
df_4 = df.iloc[:,30:40]
df_4['Target'] = df['Target']
sns.pairplot(df_2, hue = 'Target' , diag_kind='hist')
<seaborn.axisgrid.PairGrid at 0x7fa73a7267d0>
sns.pairplot(df_3, hue = 'Target' , diag_kind='hist')
<seaborn.axisgrid.PairGrid at 0x7fa745057810>
sns.pairplot(df_4, hue = 'Target' , diag_kind='hist')
<seaborn.axisgrid.PairGrid at 0x7fa737d48d10>
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target = 'Target'
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
df['Target']=df['Target'].astype('category')
feature_columns = df.columns
feature_columns = feature_columns.drop('Target')
for predictor in feature_columns:
distribution_plot_wrt_target(df, predictor)
# outlier detection using boxplot
num_cols = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(12, 8))
for i, variable in enumerate(num_cols):
plt.subplot(7, 7, i + 1)
sns.boxplot(data=df, x=variable)
plt.tight_layout(pad=2)
plt.show()
# selected features for outlier analysis
sel_cols = ['V2','V12','V14','V15','V22','V30','V31']
plt.figure(figsize=(15, 8))
for i, variable in enumerate(sel_cols):
plt.subplot(3, 3, i + 1)
sns.boxplot(data=df, x=variable)
plt.tight_layout(pad=2)
plt.show()
# separating the independent and dependent variables
X = df.drop(["Target"], axis=1)
y = df["Target"]
# creating dummy variables
#X = pd.get_dummies(X, drop_first=True)
# Splitting data into training, validation and test set:
# Splitting data into 2 parts, temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# Splitting the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.2, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(12800, 40) (3200, 40) (4000, 40)
df.isnull().sum()
V1 18 V2 18 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64
# Let's impute the missing values
imp_median = KNNImputer(n_neighbors=5)
# fit the imputer on train data and transform the train data
X_train["V1"] = imp_median.fit_transform(X_train[["V1"]])
X_train["V2"] = imp_median.fit_transform(X_train[["V2"]])
# transform the validation and test data using the imputer fit on train data
X_val["V1"] = imp_median.transform(X_val[["V1"]])
X_val["V2"] = imp_median.transform(X_val[["V2"]])
X_test["V1"] = imp_median.transform(X_test[["V1"]])
X_test["V2"] = imp_median.transform(X_test[["V2"]])
# Checking class balance for whole data, train set, validation set, and test set
print("Target value ratio in y")
print(y.value_counts(1))
print("*" * 80)
print("Target value ratio in y_train")
print(y_train.value_counts(1))
print("*" * 80)
print("Target value ratio in y_val")
print(y_val.value_counts(1))
print("*" * 80)
print("Target value ratio in y_test")
print(y_test.value_counts(1))
print("*" * 80)
Target value ratio in y 0 0.94 1 0.06 Name: Target, dtype: float64 ******************************************************************************** Target value ratio in y_train 0 0.94 1 0.06 Name: Target, dtype: float64 ******************************************************************************** Target value ratio in y_val 0 0.94 1 0.06 Name: Target, dtype: float64 ******************************************************************************** Target value ratio in y_test 0 0.94 1 0.06 Name: Target, dtype: float64 ********************************************************************************
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
Sample Decision Tree model building with original data
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: dtree: 0.7183098591549295 Validation Performance: dtree: 0.7359550561797753
# to check performance of the model on training data
dtree_default_model_train_perf = model_performance_classification_sklearn(
model, X_train, y_train
)
dtree_default_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00 | 1.00 | 1.00 | 1.00 |
# to check performance of the model on validat
dtree_default_model_val_perf = model_performance_classification_sklearn(
model, X_val, y_val
)
dtree_default_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97 | 0.74 | 0.75 | 0.74 |
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("Before OverSampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, count of label '0': {} \n".format(sum(y_train == 0)))
print("After OverSampling, count of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, count of label '0': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, count of label '1': 710 Before OverSampling, count of label '0': 12090 After OverSampling, count of label '1': 12090 After OverSampling, count of label '0': 12090 After OverSampling, the shape of train_X: (24180, 40) After OverSampling, the shape of train_y: (24180,)
dtree1 = DecisionTreeClassifier(random_state=1, max_depth=4)
# training the decision tree model with oversampled training set
dtree1.fit(X_train_over, y_train_over)
DecisionTreeClassifier(max_depth=4, random_state=1)
# Predicting the target for train and validation set
pred_train = dtree1.predict(X_train_over)
pred_val = dtree1.predict(X_val)
# to check performance of the model
dtree_oversampled_model_train_perf = model_performance_classification_sklearn(
dtree1, X_train, y_train
)
dtree_oversampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.92 | 0.85 | 0.38 | 0.53 |
# to check performance of the model
dtree_oversampled_model_val_perf = model_performance_classification_sklearn(
dtree1, X_val, y_val
)
dtree_oversampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.92 | 0.80 | 0.41 | 0.54 |
# Confusion matrix for oversampled train data
cm = confusion_matrix(y_train_over, pred_train)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# Confusion matrix for validation data
cm = confusion_matrix(y_val, pred_val)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before Under Sampling, count of label '0': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, count of label '1': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, count of label '0': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, count of label '1': 710 Before Under Sampling, count of label '0': 12090 After Under Sampling, count of label '1': 710 After Under Sampling, count of label '0': 710 After Under Sampling, the shape of train_X: (1420, 40) After Under Sampling, the shape of train_y: (1420,)
dtree2 = DecisionTreeClassifier(random_state=1, max_depth=4)
# training the decision tree model with oversampled training set
dtree2.fit(X_train_un, y_train_un)
DecisionTreeClassifier(max_depth=4, random_state=1)
# Predicting the target for train and validation set
pred_train = dtree2.predict(X_train_un)
pred_val = dtree2.predict(X_val)
# to check performance of the model
dtree_undersampled_model_train_perf = model_performance_classification_sklearn(
dtree2, X_train, y_train
)
dtree_undersampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.91 | 0.86 | 0.38 | 0.52 |
# to check performance of the model
dtree_undersampled_model_val_perf = model_performance_classification_sklearn(
dtree2, X_val, y_val
)
dtree_undersampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.92 | 0.85 | 0.39 | 0.53 |
# Confusion matrix for undersampled train data
cm = confusion_matrix(y_train_un, pred_train)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
# Confusion matrix for validation data
cm = confusion_matrix(y_val, pred_val)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.
param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }
param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }
param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }
param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }
param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }
param_grid = {'C': np.arange(0.1,1.1,0.1)}
param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7, 10],
'max_leaf_nodes' : [10,15,20],
'min_impurity_decrease': [0.0001,0.001,0.01] }
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 10, 'max_depth': 5} with CV score=0.4999999999999999:
# Set the clf to the best combination of parameters
dt1_tuned = DecisionTreeClassifier(
min_samples_leaf=1,
max_leaf_nodes=10,
max_depth=5,
min_impurity_decrease=0.001,
)
# Fit the best algorithm to the data.
dt1_tuned.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
min_impurity_decrease=0.001)
# to check performance of the model
dtree_tuned_default_model_train_perf = model_performance_classification_sklearn(
dt1_tuned, X_train, y_train
)
dtree_tuned_default_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97 | 0.56 | 0.90 | 0.69 |
# to check performance of the model
dtree_tuned_default_model_val_perf = model_performance_classification_sklearn(
dt1_tuned, X_val, y_val
)
dtree_tuned_default_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97 | 0.57 | 0.85 | 0.68 |
# Confusion matrix for undersampled train data
cm = confusion_matrix(y_train, dt1_tuned.predict(X_train))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7,10],
'max_leaf_nodes' : [10,15,20],
'min_impurity_decrease': [0.0001,0.001,0.01] }
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 4, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 10, 'max_depth': 2} with CV score=0.9129859387923904:
# Set the clf to the best combination of parameters
dt2_tuned = DecisionTreeClassifier(
min_samples_leaf=1,
max_leaf_nodes=15,
max_depth=3,
min_impurity_decrease=0.0001,
)
# Fit the best algorithm to the data.
dt2_tuned.fit(X_train_over, y_train_over)
DecisionTreeClassifier(max_depth=3, max_leaf_nodes=15,
min_impurity_decrease=0.0001)
# to check performance of the model
dtree_tuned_oversampled_model_train_perf = model_performance_classification_sklearn(
dt2_tuned, X_train, y_train
)
dtree_tuned_oversampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.90 | 0.84 | 0.33 | 0.48 |
# to check performance of the model
dtree_tuned_oversampled_model_val_perf = model_performance_classification_sklearn(
dt2_tuned, X_val, y_val
)
dtree_tuned_oversampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.90 | 0.82 | 0.34 | 0.49 |
# Confusion matrix for oversampled train data
cm = confusion_matrix(y_train_over, dt2_tuned.predict(X_train_over))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
# Confusion matrix for validation data
cm = confusion_matrix(y_val, dt2_tuned.predict(X_val))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,20),
'min_samples_leaf': [1, 2, 5, 7],
'max_leaf_nodes' : [5, 10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 11} with CV score=0.8422535211267606:
# Set the clf to the best combination of parameters
dt3_tuned = DecisionTreeClassifier(
min_samples_leaf=1,
max_leaf_nodes=15,
max_depth=11,
min_impurity_decrease=0.001,
)
# Fit the best algorithm to the data.
dt3_tuned.fit(X_train_un, y_train_un)
DecisionTreeClassifier(max_depth=11, max_leaf_nodes=15,
min_impurity_decrease=0.001)
# to check performance of the model
dtree_tuned_undersampled_model_train_perf = model_performance_classification_sklearn(
dt3_tuned, X_train, y_train
)
dtree_tuned_undersampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.92 | 0.86 | 0.39 | 0.54 |
# to check performance of the model
dtree_tuned_undersampled_model_val_perf = model_performance_classification_sklearn(
dt3_tuned, X_val, y_val
)
dtree_tuned_undersampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.92 | 0.85 | 0.41 | 0.55 |
# Confusion matrix for undersampled train data
cm = confusion_matrix(y_train_un, dt3_tuned.predict(X_train_un))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
# Confusion matrix for validation data
cm = confusion_matrix(y_val, dt3_tuned.predict(X_val))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
# training performance comparison
models_train_comp_df = pd.concat(
[
dtree_default_model_train_perf.T,
dtree_oversampled_model_train_perf.T,
dtree_undersampled_model_train_perf.T,
dtree_tuned_default_model_train_perf.T,
dtree_tuned_oversampled_model_train_perf.T,
dtree_tuned_undersampled_model_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Default Decision Tree",
"Decision Tree with oversampled data",
"Decision Tree with undersampled data",
"Tuned Default Decision Tree",
"Tuned Decision Tree with oversampled data",
"Tuned Decision Tree with undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Default Decision Tree | Decision Tree with oversampled data | Decision Tree with undersampled data | Tuned Default Decision Tree | Tuned Decision Tree with oversampled data | Tuned Decision Tree with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 1.00 | 0.92 | 0.91 | 0.97 | 0.90 | 0.92 |
| Recall | 1.00 | 0.85 | 0.86 | 0.56 | 0.84 | 0.86 |
| Precision | 1.00 | 0.38 | 0.38 | 0.90 | 0.33 | 0.39 |
| F1 | 1.00 | 0.53 | 0.52 | 0.69 | 0.48 | 0.54 |
# Validation performance comparison
models_val_comp_df = pd.concat(
[
dtree_default_model_val_perf.T,
dtree_oversampled_model_val_perf.T,
dtree_undersampled_model_val_perf.T,
dtree_tuned_default_model_val_perf.T,
dtree_tuned_oversampled_model_val_perf.T,
dtree_tuned_undersampled_model_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Default Decision Tree",
"Decision Tree with oversampled data",
"Decision Tree with undersampled data",
"Tuned Default Decision Tree",
"Tuned Decision Tree with oversampled data",
"Tuned Decision Tree with undersampled data",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Default Decision Tree | Decision Tree with oversampled data | Decision Tree with undersampled data | Tuned Default Decision Tree | Tuned Decision Tree with oversampled data | Tuned Decision Tree with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.97 | 0.92 | 0.92 | 0.97 | 0.90 | 0.92 |
| Recall | 0.74 | 0.80 | 0.85 | 0.57 | 0.82 | 0.85 |
| Precision | 0.75 | 0.41 | 0.39 | 0.85 | 0.34 | 0.41 |
| F1 | 0.74 | 0.54 | 0.53 | 0.68 | 0.49 | 0.55 |
Decision Tree with Undersampled Data and Tuned Decision Tree with Undersampled Data. Decision Tree with Undersampled Data will be selected for the Decision Tree models because it is simpler# to check performance of the model on the test data
dtree_undersampled_model_test_perf = model_performance_classification_sklearn(
dtree2, X_test, y_test
)
dtree_undersampled_model_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.91 | 0.84 | 0.35 | 0.50 |
Decision Tree with Undersampled Data model has recall performance on the test data of 84% which is comparable to the model's recall performance on the validation data (85%)# Confusion matrix for the selected model on test data
cm = confusion_matrix(y_test, dtree2.predict(X_test))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# importance of features in the tree building
print(pd.DataFrame(dtree2.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp V18 0.45 V39 0.21 V3 0.13 V26 0.09 V10 0.03 V9 0.03 V11 0.02 V13 0.02 V12 0.02 V14 0.01 V38 0.00 V35 0.00 V36 0.00 V37 0.00 V25 0.00 V34 0.00 V33 0.00 V32 0.00 V31 0.00 V30 0.00 V29 0.00 V28 0.00 V27 0.00 V1 0.00 V21 0.00 V24 0.00 V23 0.00 V22 0.00 V2 0.00 V20 0.00 V19 0.00 V17 0.00 V16 0.00 V15 0.00 V8 0.00 V7 0.00 V6 0.00 V5 0.00 V4 0.00 V40 0.00
feature_names = X_train.columns
importances = dtree2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("dtree", RandomForestClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: dtree: 0.7084507042253521 Validation Performance: dtree: 0.7078651685393258
# to check performance of the model on training data
rf_default_model_train_perf = model_performance_classification_sklearn(
model, X_train, y_train
)
rf_default_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00 | 1.00 | 1.00 | 1.00 |
# to check performance of the model on validation data
rf_default_model_val_perf = model_performance_classification_sklearn(
model, X_val, y_val
)
rf_default_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.71 | 0.98 | 0.82 |
rf1 = RandomForestClassifier(random_state=1)
# training the random forest model with oversampled training set
rf1.fit(X_train_over, y_train_over)
RandomForestClassifier(random_state=1)
# to check performance of the model
rf_oversampled_model_train_perf = model_performance_classification_sklearn(
rf1, X_train, y_train
)
rf_oversampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00 | 1.00 | 1.00 | 1.00 |
# to check performance of the model
rf_oversampled_model_val_perf = model_performance_classification_sklearn(
rf1, X_val, y_val
)
rf_oversampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.85 | 0.96 | 0.90 |
# Confusion matrix for oversampled train data
cm = confusion_matrix(y_train_over, rf1.predict(X_train_over))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# Confusion matrix for validation data
cm = confusion_matrix(y_val, rf1.predict(X_val))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
rf2 = RandomForestClassifier(random_state=1)
# training the random forest model with unersampled training set
rf2.fit(X_train_un, y_train_un)
RandomForestClassifier(random_state=1)
# to check performance of the model
rf_undersampled_model_train_perf = model_performance_classification_sklearn(
rf2, X_train, y_train
)
rf_undersampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.96 | 1.00 | 0.60 | 0.75 |
# to check performance of the model
rf_undersampled_model_val_perf = model_performance_classification_sklearn(
rf2, X_val, y_val
)
rf_undersampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97 | 0.92 | 0.63 | 0.75 |
# Confusion matrix for undersampled train data
cm = confusion_matrix(y_train_un, rf2.predict(X_train_un))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# Confusion matrix for undersampled validation data
cm = confusion_matrix(y_val, rf2.predict(X_val))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': [200,250,300],
'min_samples_leaf': [1, 4],
'max_features' : [np.arange(0.3, 0.6, 0.1),'sqrt'],
'max_samples': np.arange(0.4, 0.7, 0.1) }
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 200, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.6901408450704226:
# Set the clf to the best combination of parameters
rf1_tuned = RandomForestClassifier(
n_estimators=200,
min_samples_leaf=1,
max_features='sqrt',
max_samples=0.6,
)
# Fit the best algorithm to the data.
rf1_tuned.fit(X_train, y_train)
RandomForestClassifier(max_features='sqrt', max_samples=0.6, n_estimators=200)
# to check performance of the model on the training data
rf1_tuned_default_model_train_perf = model_performance_classification_sklearn(
rf1_tuned, X_train, y_train
)
rf1_tuned_default_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.90 | 1.00 | 0.95 |
# to check performance of the model on the validation data
rf1_tuned_default_model_val_perf = model_performance_classification_sklearn(
rf1_tuned, X_val, y_val
)
rf1_tuned_default_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.70 | 0.98 | 0.82 |
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': [200,250,300],
'min_samples_leaf': [1, 4],
'max_features' : [np.arange(0.3, 0.6, 0.1),'sqrt'],
'max_samples': np.arange(0.4, 0.7, 0.1) }
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 200, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.9812241521918942:
# Set the clf to the best combination of parameters
rf2_tuned = RandomForestClassifier(
n_estimators=200,
min_samples_leaf=1,
max_features='sqrt',
max_samples=0.6,
)
# Fit the best algorithm to the data.
rf2_tuned.fit(X_train_over, y_train_over)
RandomForestClassifier(max_features='sqrt', max_samples=0.6, n_estimators=200)
# to check performance of the model on the training data
rf2_tuned_oversampled_model_train_perf = model_performance_classification_sklearn(
rf2_tuned, X_train, y_train
)
rf2_tuned_oversampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00 | 1.00 | 1.00 | 1.00 |
# to check performance of the model on the validation data
rf2_tuned_oversampled_model_val_perf = model_performance_classification_sklearn(
rf2_tuned, X_val, y_val
)
rf2_tuned_oversampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.87 | 0.96 | 0.91 |
# defining model
Model = RandomForestClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': [200,250,300],
'min_samples_leaf': [1, 4],
'max_features' : [np.arange(0.3, 0.6, 0.1),'sqrt'],
'max_samples': np.arange(0.4, 0.7, 0.1) }
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 200, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.8774647887323944:
# Set the clf to the best combination of parameters
rf3_tuned = RandomForestClassifier(
n_estimators=200,
min_samples_leaf=1,
max_features='sqrt',
max_samples=0.6,
)
# Fit the best algorithm to the data.
rf3_tuned.fit(X_train_un, y_train_un)
RandomForestClassifier(max_features='sqrt', max_samples=0.6, n_estimators=200)
# to check performance of the model on the training data
rf3_tuned_undersampled_model_train_perf = model_performance_classification_sklearn(
rf3_tuned, X_train, y_train
)
rf3_tuned_undersampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.96 | 0.98 | 0.58 | 0.73 |
# to check performance of the model on the validation data
rf3_tuned_undersampled_model_val_perf = model_performance_classification_sklearn(
rf3_tuned, X_val, y_val
)
rf3_tuned_undersampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.96 | 0.92 | 0.59 | 0.71 |
# training performance comparison
models_train_comp_df = pd.concat(
[
rf_default_model_train_perf.T,
rf_oversampled_model_train_perf.T,
rf_undersampled_model_train_perf.T,
rf1_tuned_default_model_train_perf.T,
rf2_tuned_oversampled_model_train_perf.T,
rf3_tuned_undersampled_model_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Default Random Forest",
"Random Forest with oversampled data",
"Random Forest with undersampled data",
"Tuned Default Random Forest",
"Tuned Random Forest with oversampled data",
"Tuned Random Forest with undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Default Random Forest | Random Forest with oversampled data | Random Forest with undersampled data | Tuned Default Random Forest | Tuned Random Forest with oversampled data | Tuned Random Forest with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 1.00 | 1.00 | 0.96 | 0.99 | 1.00 | 0.96 |
| Recall | 1.00 | 1.00 | 1.00 | 0.90 | 1.00 | 0.98 |
| Precision | 1.00 | 1.00 | 0.60 | 1.00 | 1.00 | 0.58 |
| F1 | 1.00 | 1.00 | 0.75 | 0.95 | 1.00 | 0.73 |
# Validation performance comparison
models_val_comp_df = pd.concat(
[
rf_default_model_val_perf.T,
rf_oversampled_model_val_perf.T,
rf_undersampled_model_val_perf.T,
rf1_tuned_default_model_val_perf.T,
rf2_tuned_oversampled_model_val_perf.T,
rf3_tuned_undersampled_model_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Default Random Forest",
"Random Forest with oversampled data",
"Random Forest with undersampled data",
"Tuned Default Random Forest",
"Tuned Random Forest with oversampled data",
"Tuned Random Forest with undersampled data",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Default Random Forest | Random Forest with oversampled data | Random Forest with undersampled data | Tuned Default Random Forest | Tuned Random Forest with oversampled data | Tuned Random Forest with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.98 | 0.99 | 0.97 | 0.98 | 0.99 | 0.96 |
| Recall | 0.71 | 0.85 | 0.92 | 0.70 | 0.87 | 0.92 |
| Precision | 0.98 | 0.96 | 0.63 | 0.98 | 0.96 | 0.59 |
| F1 | 0.82 | 0.90 | 0.75 | 0.82 | 0.91 | 0.71 |
Random Forest with Undersampled Data and Tuned Random Forest with Undersampled Data. Tuned Random Forest with Undersampled Data will be selected for the Decision Tree models because it generalizes better# to check performance of the model on the test data
rf3_tuned_undersampled_model_test_perf = model_performance_classification_sklearn(
rf3_tuned, X_test, y_test
)
rf3_tuned_undersampled_model_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95 | 0.87 | 0.52 | 0.65 |
X_test data shows that the models is overfitting and may be unreliable# Confusion matrix for the selected model on test data
cm = confusion_matrix(y_test, rf3_tuned.predict(X_test))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# importance of features in the tree building
print(pd.DataFrame(rf3_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp V36 0.09 V18 0.09 V39 0.06 V15 0.04 V26 0.04 V16 0.04 V21 0.03 V7 0.03 V14 0.03 V28 0.03 V11 0.03 V3 0.03 V12 0.03 V9 0.03 V34 0.02 V13 0.02 V35 0.02 V5 0.02 V4 0.02 V37 0.02 V20 0.02 V31 0.02 V38 0.02 V24 0.02 V2 0.02 V40 0.02 V30 0.01 V33 0.01 V19 0.01 V10 0.01 V6 0.01 V25 0.01 V8 0.01 V1 0.01 V27 0.01 V17 0.01 V22 0.01 V23 0.01 V29 0.01 V32 0.01
feature_names = X_train.columns
importances = rf3_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("dtree", LogisticRegression(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: dtree: 0.476056338028169 Validation Performance: dtree: 0.5
# to check performance of the model on training data
lgr_default_model_train_perf = model_performance_classification_sklearn(
model, X_train, y_train
)
lgr_default_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97 | 0.48 | 0.86 | 0.62 |
# to check performance of the model on validation data
lgr_default_model_val_perf = model_performance_classification_sklearn(
model, X_val, y_val
)
lgr_default_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97 | 0.50 | 0.83 | 0.62 |
lgr1 = LogisticRegression(random_state=1)
# training the logistic regression model with oversampled training set
lgr1.fit(X_train_over, y_train_over)
LogisticRegression(random_state=1)
# to check performance of the model training data
lgr_oversampled_model_train_perf = model_performance_classification_sklearn(
lgr1, X_train, y_train
)
lgr_oversampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.87 | 0.85 | 0.28 | 0.42 |
# to check performance of the model on the validation data
lgr_oversampled_model_val_perf = model_performance_classification_sklearn(
lgr1, X_val, y_val
)
lgr_oversampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.87 | 0.89 | 0.29 | 0.44 |
# Confusion matrix for oversampled train data
cm = confusion_matrix(y_train_over, lgr1.predict(X_train_over))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# Confusion matrix for validation data
cm = confusion_matrix(y_val, lgr1.predict(X_val))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
lgr2 = LogisticRegression(random_state=1)
# training the logistic regression model with undersampled training set
lgr2.fit(X_train_un, y_train_un)
LogisticRegression(random_state=1)
# to check performance of the model on the training data
lgr_undersampled_model_train_perf = model_performance_classification_sklearn(
lgr2, X_train, y_train
)
lgr_undersampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.87 | 0.85 | 0.27 | 0.41 |
# to check performance of the model on the validation data
lgr_undersampled_model_val_perf = model_performance_classification_sklearn(
lgr2, X_val, y_val
)
lgr_undersampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.87 | 0.89 | 0.29 | 0.44 |
# Confusion matrix for undersampled train data
cm = confusion_matrix(y_train_over, lgr2.predict(X_train_over))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# Confusion matrix for validation data
cm = confusion_matrix(y_val, lgr2.predict(X_val))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
For Logistic Regression: param_grid = {'C': np.arange(0.1,1.1,0.1)}
# defining model
Model = LogisticRegression(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'C': np.arange(0.1,1.1,0.1)}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'C': 0.1} with CV score=0.476056338028169:
# Set the clf to the best combination of parameters
lgr1_tuned = LogisticRegression(
C=0.1,
)
# Fit the best algorithm to the data.
lgr1_tuned.fit(X_train, y_train)
LogisticRegression(C=0.1)
# to check performance of the model on the training data
lgr_tuned_default_model_train_perf = model_performance_classification_sklearn(
lgr1_tuned, X_train, y_train
)
lgr_tuned_default_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97 | 0.48 | 0.87 | 0.62 |
# to check performance of the model on the validation data
lgr_tuned_default_model_val_perf = model_performance_classification_sklearn(
lgr1_tuned, X_val, y_val
)
lgr_tuned_default_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97 | 0.50 | 0.83 | 0.62 |
# defining model
Model = LogisticRegression(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'C': np.arange(0.1,1.1,0.1)}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'C': 0.1} with CV score=0.8738626964433417:
# Set the clf to the best combination of parameters
lgr2_tuned = LogisticRegression(
C=0.1,
)
# Fit the best algorithm to the data.
lgr2_tuned.fit(X_train_over, y_train_over)
LogisticRegression(C=0.1)
# to check performance of the model on the training data
lgr_tuned_oversampled_model_train_perf = model_performance_classification_sklearn(
lgr2_tuned, X_train, y_train
)
lgr_tuned_oversampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.87 | 0.85 | 0.28 | 0.42 |
# to check performance of the model on the validation data
lgr_tuned_oversampled_model_val_perf = model_performance_classification_sklearn(
lgr2_tuned, X_val, y_val
)
lgr_tuned_oversampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.87 | 0.89 | 0.29 | 0.44 |
# defining model
Model = LogisticRegression(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'C': np.arange(0.1,1.1,0.1)}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'C': 0.1} with CV score=0.847887323943662:
# Set the clf to the best combination of parameters
lgr3_tuned = LogisticRegression(
C=0.1,
)
# Fit the best algorithm to the data.
lgr3_tuned.fit(X_train_un, y_train_un)
LogisticRegression(C=0.1)
# to check performance of the model on the training data
lgr_tuned_undersampled_model_train_perf = model_performance_classification_sklearn(
lgr3_tuned, X_train, y_train
)
lgr_tuned_undersampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.87 | 0.85 | 0.27 | 0.41 |
# to check performance of the model on the validation data
lgr_tuned_undersampled_model_val_perf = model_performance_classification_sklearn(
lgr3_tuned, X_val, y_val
)
lgr_tuned_undersampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.87 | 0.89 | 0.29 | 0.44 |
# training performance comparison
models_train_comp_df = pd.concat(
[
lgr_default_model_train_perf.T,
lgr_oversampled_model_train_perf.T,
lgr_undersampled_model_train_perf.T,
lgr_tuned_default_model_train_perf.T,
lgr_tuned_oversampled_model_train_perf.T,
lgr_tuned_undersampled_model_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Default Logistic Regression",
"Logistic Regression with oversampled data",
"Logistic Regression with undersampled data",
"Tuned Default Logistic Regression",
"Tuned Logistic Regression with oversampled data",
"Tuned Logistic Regression with undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Default Logistic Regression | Logistic Regression with oversampled data | Logistic Regression with undersampled data | Tuned Default Logistic Regression | Tuned Logistic Regression with oversampled data | Tuned Logistic Regression with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.97 | 0.87 | 0.87 | 0.97 | 0.87 | 0.87 |
| Recall | 0.48 | 0.85 | 0.85 | 0.48 | 0.85 | 0.85 |
| Precision | 0.86 | 0.28 | 0.27 | 0.87 | 0.28 | 0.27 |
| F1 | 0.62 | 0.42 | 0.41 | 0.62 | 0.42 | 0.41 |
# Validation performance comparison
models_val_comp_df = pd.concat(
[
lgr_default_model_val_perf.T,
lgr_oversampled_model_val_perf.T,
lgr_undersampled_model_val_perf.T,
lgr_tuned_default_model_val_perf.T,
lgr_tuned_oversampled_model_val_perf.T,
lgr_tuned_undersampled_model_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Default Logistic Regression",
"Logistic Regression with oversampled data",
"Logistic Regression with undersampled data",
"Tuned Default Logistic Regression",
"Tuned Logistic Regression with oversampled data",
"Tuned Logistic Regression with undersampled data",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Default Logistic Regression | Logistic Regression with oversampled data | Logistic Regression with undersampled data | Tuned Default Logistic Regression | Tuned Logistic Regression with oversampled data | Tuned Logistic Regression with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.97 | 0.87 | 0.87 | 0.97 | 0.87 | 0.87 |
| Recall | 0.50 | 0.89 | 0.89 | 0.50 | 0.89 | 0.89 |
| Precision | 0.83 | 0.29 | 0.29 | 0.83 | 0.29 | 0.29 |
| F1 | 0.62 | 0.44 | 0.44 | 0.62 | 0.44 | 0.44 |
Logistic Regression model with undersampled data will be selected as the logistic regression model# to check performance of the model on the test data
lgr_undersampled_model_test_perf = model_performance_classification_sklearn(
lgr2, X_test, y_test
)
lgr_undersampled_model_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.86 | 0.86 | 0.26 | 0.41 |
# Confusion matrix for the selected model on test data
cm = confusion_matrix(y_test, lgr2.predict(X_test))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("dtree", AdaBoostClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: dtree: 0.5985915492957747 Validation Performance: dtree: 0.6460674157303371
# to check performance of the model on training data
adb_default_model_train_perf = model_performance_classification_sklearn(
model, X_train, y_train
)
adb_default_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.63 | 0.90 | 0.74 |
# to check performance of the model on validation data
adb_default_model_val_perf = model_performance_classification_sklearn(
model, X_val, y_val
)
adb_default_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.65 | 0.90 | 0.75 |
adb1 = AdaBoostClassifier(random_state=1)
# training the logistic regression model with oversampled training set
adb1.fit(X_train_over, y_train_over)
AdaBoostClassifier(random_state=1)
# to check performance of the model training data
adb_oversampled_model_train_perf = model_performance_classification_sklearn(
adb1, X_train, y_train
)
adb_oversampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.92 | 0.87 | 0.40 | 0.55 |
# to check performance of the model on the validation data
adb_oversampled_model_val_perf = model_performance_classification_sklearn(
adb1, X_val, y_val
)
adb_oversampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.92 | 0.86 | 0.39 | 0.54 |
# Confusion matrix for oversampled train data
cm = confusion_matrix(y_train_over, adb1.predict(X_train_over))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# Confusion matrix for validation data
cm = confusion_matrix(y_val, adb1.predict(X_val))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
adb2 = AdaBoostClassifier(random_state=1)
# training the logistic regression model with undersampled training set
adb2.fit(X_train_un, y_train_un)
AdaBoostClassifier(random_state=1)
# to check performance of the model training data
adb_undersampled_model_train_perf = model_performance_classification_sklearn(
adb2, X_train, y_train
)
adb_undersampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.89 | 0.90 | 0.32 | 0.47 |
# to check performance of the model on the validation data
adb_undersampled_model_val_perf = model_performance_classification_sklearn(
adb2, X_val, y_val
)
adb_undersampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.89 | 0.89 | 0.33 | 0.48 |
# Confusion matrix for undersampled train data
cm = confusion_matrix(y_train_over, adb2.predict(X_train_over))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# Confusion matrix for validation data
cm = confusion_matrix(y_val, adb2.predict(X_val))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
For Adaboost: param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': [100,150,200],
'learning_rate': [0.2, 0.05],
'base_estimator' : [DecisionTreeClassifier(max_depth=1, random_state=1),DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ]
}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 200, 'learning_rate': 0.2, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.7492957746478874:
# Set the clf to the best combination of parameters
adb1_tuned = AdaBoostClassifier(
n_estimators=200,
learning_rate=0.2,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
# Fit the best algorithm to the data.
adb1_tuned.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=200)
# to check performance of the model on the training data
adb_tuned_default_model_train_perf = model_performance_classification_sklearn(
adb1_tuned, X_train, y_train
)
adb_tuned_default_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00 | 1.00 | 1.00 | 1.00 |
# to check performance of the model on the validation data
adb_tuned_default_model_val_perf = model_performance_classification_sklearn(
adb1_tuned, X_val, y_val
)
adb_tuned_default_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.78 | 0.99 | 0.87 |
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': [100,150,200],
'learning_rate': [0.2, 0.05],
'base_estimator' : [DecisionTreeClassifier(max_depth=1, random_state=1),DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ]
}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 200, 'learning_rate': 0.2, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9772539288668322:
# Set the clf to the best combination of parameters
adb2_tuned = AdaBoostClassifier(
n_estimators=200,
learning_rate=0.2,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
# Fit the best algorithm to the data.
adb2_tuned.fit(X_train_over, y_train_over)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=200)
# to check performance of the model on the training data
adb_tuned_oversampled_model_train_perf = model_performance_classification_sklearn(
adb2_tuned, X_train, y_train
)
adb_tuned_oversampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00 | 1.00 | 0.94 | 0.97 |
# to check performance of the model on the validation data
adb_tuned_oversampled_model_val_perf = model_performance_classification_sklearn(
adb2_tuned, X_val, y_val
)
adb_tuned_oversampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.87 | 0.88 | 0.87 |
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': [100,150,200],
'learning_rate': [0.2, 0.05],
'base_estimator' : [DecisionTreeClassifier(max_depth=1, random_state=1),DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ]
}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 200, 'learning_rate': 0.2, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.8788732394366198:
# Set the clf to the best combination of parameters
adb3_tuned = AdaBoostClassifier(
n_estimators=200,
learning_rate=0.2,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
# Fit the best algorithm to the data.
adb3_tuned.fit(X_train_un, y_train_un)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=200)
# to check performance of the model on the training data
adb_tuned_undersampled_model_train_perf = model_performance_classification_sklearn(
adb3_tuned, X_train, y_train
)
adb_tuned_undersampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.94 | 1.00 | 0.48 | 0.64 |
# to check performance of the model on the validation data
adb_tuned_undersampled_model_val_perf = model_performance_classification_sklearn(
adb3_tuned, X_val, y_val
)
adb_tuned_undersampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.93 | 0.90 | 0.44 | 0.59 |
# training performance comparison
models_train_comp_df = pd.concat(
[
adb_default_model_train_perf.T,
adb_oversampled_model_train_perf.T,
adb_undersampled_model_train_perf.T,
adb_tuned_default_model_train_perf.T,
adb_tuned_oversampled_model_train_perf.T,
adb_tuned_undersampled_model_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Default AdaBoost",
"AdaBoost with oversampled data",
"AdaBoost with undersampled data",
"Tuned Default AdaBoost",
"Tuned AdaBoost with oversampled data",
"Tuned AdaBoost with undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Default AdaBoost | AdaBoost with oversampled data | AdaBoost with undersampled data | Tuned Default AdaBoost | Tuned AdaBoost with oversampled data | Tuned AdaBoost with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.98 | 0.92 | 0.89 | 1.00 | 1.00 | 0.94 |
| Recall | 0.63 | 0.87 | 0.90 | 1.00 | 1.00 | 1.00 |
| Precision | 0.90 | 0.40 | 0.32 | 1.00 | 0.94 | 0.48 |
| F1 | 0.74 | 0.55 | 0.47 | 1.00 | 0.97 | 0.64 |
# Validation performance comparison
models_val_comp_df = pd.concat(
[
adb_default_model_val_perf.T,
adb_oversampled_model_val_perf.T,
adb_undersampled_model_val_perf.T,
adb_tuned_default_model_val_perf.T,
adb_tuned_oversampled_model_val_perf.T,
adb_tuned_undersampled_model_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Default AdaBoost",
"AdaBoost with oversampled data",
"AdaBoost with undersampled data",
"Tuned Default AdaBoost",
"Tuned AdaBoost with oversampled data",
"Tuned AdaBoost with undersampled data",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Default AdaBoost | AdaBoost with oversampled data | AdaBoost with undersampled data | Tuned Default AdaBoost | Tuned AdaBoost with oversampled data | Tuned AdaBoost with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.98 | 0.92 | 0.89 | 0.99 | 0.99 | 0.93 |
| Recall | 0.65 | 0.86 | 0.89 | 0.78 | 0.87 | 0.90 |
| Precision | 0.90 | 0.39 | 0.33 | 0.99 | 0.88 | 0.44 |
| F1 | 0.75 | 0.54 | 0.48 | 0.87 | 0.87 | 0.59 |
Tuned AdaBoost with undersampled data and AdaBoost with undersampled data. However, the former is overfitting on the training dataAdaBoost with undersampled data is the better model and the selected Adaboost model# to check performance of the model on the test data
adb_undersampled_model_test_perf = model_performance_classification_sklearn(
adb2, X_test, y_test
)
adb_undersampled_model_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.88 | 0.86 | 0.31 | 0.45 |
# Confusion matrix for the selected model on test data
cm = confusion_matrix(y_test, adb2.predict(X_test))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# importance of features in the tree building
print(pd.DataFrame(adb2.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp V18 0.08 V3 0.06 V37 0.06 V26 0.06 V30 0.06 V9 0.06 V2 0.06 V21 0.04 V23 0.04 V17 0.04 V14 0.04 V34 0.04 V12 0.04 V36 0.04 V24 0.02 V33 0.02 V32 0.02 V28 0.02 V25 0.02 V15 0.02 V20 0.02 V35 0.02 V13 0.02 V11 0.02 V10 0.02 V7 0.02 V38 0.02 V39 0.02 V31 0.00 V1 0.00 V29 0.00 V27 0.00 V22 0.00 V19 0.00 V16 0.00 V8 0.00 V6 0.00 V5 0.00 V4 0.00 V40 0.00
feature_names = X_train.columns
importances = adb2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("dtree", GradientBoostingClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: dtree: 0.7028169014084507 Validation Performance: dtree: 0.7640449438202247
# to check performance of the model on training data
grb_default_model_train_perf = model_performance_classification_sklearn(
model, X_train, y_train
)
grb_default_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.83 | 0.99 | 0.90 |
# to check performance of the model on validation data
grb_default_model_val_perf = model_performance_classification_sklearn(
model, X_val, y_val
)
grb_default_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.76 | 0.89 | 0.82 |
grb1 = GradientBoostingClassifier(random_state=1)
# training the gradient boost model with oversampled training set
grb1.fit(X_train_over, y_train_over)
GradientBoostingClassifier(random_state=1)
# to check performance of the model training data
grb_oversampled_model_train_perf = model_performance_classification_sklearn(
grb1, X_train, y_train
)
grb_oversampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97 | 0.91 | 0.68 | 0.78 |
# to check performance of the model on the validation data
grb_oversampled_model_val_perf = model_performance_classification_sklearn(
grb1, X_val, y_val
)
grb_oversampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97 | 0.90 | 0.69 | 0.78 |
# Confusion matrix for oversampled train data
cm = confusion_matrix(y_train_over, grb1.predict(X_train_over))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# Confusion matrix for validation data
cm = confusion_matrix(y_val, grb1.predict(X_val))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
grb2 = GradientBoostingClassifier(random_state=1)
# training the gradient boost model with undersampled training set
grb2.fit(X_train_un, y_train_un)
GradientBoostingClassifier(random_state=1)
# to check performance of the model training data
grb_undersampled_model_train_perf = model_performance_classification_sklearn(
grb2, X_train, y_train
)
grb_undersampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95 | 0.95 | 0.53 | 0.68 |
# to check performance of the model on the validation data
grb_undersampled_model_val_perf = model_performance_classification_sklearn(
grb2, X_val, y_val
)
grb_undersampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95 | 0.90 | 0.52 | 0.66 |
# Confusion matrix for undersampled train data
cm = confusion_matrix(y_train_over, grb2.predict(X_train_over))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# Confusion matrix for validation data
cm = confusion_matrix(y_val, grb2.predict(X_val))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
For Gradient Boosting: param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }
# defining model
Model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': [100,150,25],
'learning_rate': [0.2, 0.05,1],
'subsample' : [0.5,0.7],
'max_features' : [0.5,0.7],
}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 150, 'max_features': 0.5, 'learning_rate': 0.2} with CV score=0.7436619718309859:
# Set the clf to the best combination of parameters
grb1_tuned = GradientBoostingClassifier(
n_estimators=150,
subsample=0.7,
max_features=0.5,
learning_rate=0.2,
)
# Fit the best algorithm to the data.
grb1_tuned.fit(X_train, y_train)
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=150, subsample=0.7)
# to check performance of the model on the training data
grb_tuned_default_model_train_perf = model_performance_classification_sklearn(
grb1_tuned, X_train, y_train
)
grb_tuned_default_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00 | 0.95 | 0.99 | 0.97 |
# to check performance of the model on the validation data
grb_tuned_default_model_val_perf = model_performance_classification_sklearn(
grb1_tuned, X_val, y_val
)
grb_tuned_default_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.80 | 0.87 | 0.83 |
The tuned default model is overfitting
# defining model
Model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': [100,150,25],
'learning_rate': [0.2, 0.05,1],
'subsample' : [0.5,0.7],
'max_features' : [0.5,0.7],
}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.5, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 1} with CV score=0.9640198511166254:
# Set the clf to the best combination of parameters
grb2_tuned = GradientBoostingClassifier(
n_estimators=100,
subsample=0.5,
max_features=0.7,
learning_rate=1,
)
# Fit the best algorithm to the data.
grb2_tuned.fit(X_train_over, y_train_over)
GradientBoostingClassifier(learning_rate=1, max_features=0.7, subsample=0.5)
# to check performance of the model on the training data
grb_tuned_oversampled_model_train_perf = model_performance_classification_sklearn(
grb2_tuned, X_train, y_train
)
grb_tuned_oversampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.98 | 0.78 | 0.87 |
# to check performance of the model on the validation data
grb_tuned_oversampled_model_val_perf = model_performance_classification_sklearn(
grb2_tuned, X_val, y_val
)
grb_tuned_oversampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.96 | 0.87 | 0.58 | 0.70 |
The tuned oversampled model is also overfitting
# defining model
Model = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': [100,150,25],
'learning_rate': [0.2, 0.05,1],
'subsample' : [0.5,0.7],
'max_features' : [0.5,0.7],
}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 150, 'max_features': 0.5, 'learning_rate': 0.2} with CV score=0.8859154929577466:
# Set the clf to the best combination of parameters
grb3_tuned = GradientBoostingClassifier(
n_estimators=150,
subsample=0.7,
max_features=0.5,
learning_rate=0.2,
)
# Fit the best algorithm to the data.
grb3_tuned.fit(X_train_un, y_train_un)
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=150, subsample=0.7)
# to check performance of the model on the training data
grb_tuned_undersampled_model_train_perf = model_performance_classification_sklearn(
grb3_tuned, X_train, y_train
)
grb_tuned_undersampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95 | 1.00 | 0.55 | 0.71 |
# to check performance of the model on the validation data
grb_tuned_undersampled_model_val_perf = model_performance_classification_sklearn(
grb3_tuned, X_val, y_val
)
grb_tuned_undersampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95 | 0.91 | 0.51 | 0.65 |
# training performance comparison
models_train_comp_df = pd.concat(
[
grb_default_model_train_perf.T,
grb_oversampled_model_train_perf.T,
grb_undersampled_model_train_perf.T,
grb_tuned_default_model_train_perf.T,
grb_tuned_oversampled_model_train_perf.T,
grb_tuned_undersampled_model_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Default Gradient Boost",
"Gradient Boost with oversampled data",
"Gradient Boost with undersampled data",
"Tuned Default Gradient Boost",
"Tuned Gradient Boost with oversampled data",
"Tuned Gradient Boost with undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Default Gradient Boost | Gradient Boost with oversampled data | Gradient Boost with undersampled data | Tuned Default Gradient Boost | Tuned Gradient Boost with oversampled data | Tuned Gradient Boost with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.99 | 0.97 | 0.95 | 1.00 | 0.98 | 0.95 |
| Recall | 0.83 | 0.91 | 0.95 | 0.95 | 0.98 | 1.00 |
| Precision | 0.99 | 0.68 | 0.53 | 0.99 | 0.78 | 0.55 |
| F1 | 0.90 | 0.78 | 0.68 | 0.97 | 0.87 | 0.71 |
# Validation performance comparison
models_val_comp_df = pd.concat(
[
grb_default_model_val_perf.T,
grb_oversampled_model_val_perf.T,
grb_undersampled_model_val_perf.T,
grb_tuned_default_model_val_perf.T,
grb_tuned_oversampled_model_val_perf.T,
grb_tuned_undersampled_model_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Default Gradient Boost",
"Gradient Boost with oversampled data",
"Gradient Boost with undersampled data",
"Tuned Default Gradient Boost",
"Tuned Gradient Boost with oversampled data",
"Tuned Gradient Boost with undersampled data",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Default Gradient Boost | Gradient Boost with oversampled data | Gradient Boost with undersampled data | Tuned Default Gradient Boost | Tuned Gradient Boost with oversampled data | Tuned Gradient Boost with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.98 | 0.97 | 0.95 | 0.98 | 0.96 | 0.95 |
| Recall | 0.76 | 0.90 | 0.90 | 0.80 | 0.87 | 0.91 |
| Precision | 0.89 | 0.69 | 0.52 | 0.87 | 0.58 | 0.51 |
| F1 | 0.82 | 0.78 | 0.66 | 0.83 | 0.70 | 0.65 |
The best performing models on the validation data are Untuned Gradient Boost with overersampled data and Untuned Gradient Boost with undersampled data
Untuned Gradient Boost with overersampled data model is the selected model because it generalizes better
# to check performance of the model on the test data
grb_oversampled_model_test_perf = model_performance_classification_sklearn(
grb1, X_test, y_test
)
grb_oversampled_model_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.96 | 0.86 | 0.63 | 0.73 |
# Confusion matrix for the selected model on test data
cm = confusion_matrix(y_test, grb1.predict(X_test))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# importance of features in the tree building
print(pd.DataFrame(grb1.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp V36 0.26 V18 0.17 V14 0.09 V39 0.09 V26 0.06 V16 0.05 V9 0.04 V3 0.03 V15 0.02 V12 0.02 V35 0.02 V7 0.01 V10 0.01 V37 0.01 V1 0.01 V34 0.01 V38 0.01 V21 0.01 V27 0.01 V30 0.01 V11 0.01 V33 0.01 V5 0.01 V6 0.00 V13 0.00 V32 0.00 V4 0.00 V17 0.00 V24 0.00 V40 0.00 V2 0.00 V20 0.00 V28 0.00 V22 0.00 V29 0.00 V8 0.00 V31 0.00 V23 0.00 V19 0.00 V25 0.00
feature_names = X_train.columns
importances = grb1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("dtree", XGBClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: dtree: 0.719718309859155 Validation Performance: dtree: 0.7752808988764045
# to check performance of the model on training data
xgb_default_model_train_perf = model_performance_classification_sklearn(
model, X_train, y_train
)
xgb_default_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.81 | 0.99 | 0.89 |
# to check performance of the model on validation data
xgb_default_model_val_perf = model_performance_classification_sklearn(
model, X_val, y_val
)
xgb_default_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.78 | 0.98 | 0.87 |
xgb1 = XGBClassifier(random_state=1)
# training the gradient boost model with oversampled training set
xgb1.fit(X_train_over, y_train_over)
XGBClassifier(random_state=1)
# to check performance of the model training data
xgb_oversampled_model_train_perf = model_performance_classification_sklearn(
xgb1, X_train, y_train
)
xgb_oversampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97 | 0.90 | 0.69 | 0.79 |
# to check performance of the model on the validation data
xgb_oversampled_model_val_perf = model_performance_classification_sklearn(
xgb1, X_val, y_val
)
xgb_oversampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97 | 0.89 | 0.70 | 0.78 |
# Confusion matrix for oversampled train data
cm = confusion_matrix(y_train_over, xgb1.predict(X_train_over))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# Confusion matrix for validation data
cm = confusion_matrix(y_val, xgb1.predict(X_val))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
xgb2 = XGBClassifier(random_state=1)
# training the gradient boost model with undersampled training set
xgb2.fit(X_train_un, y_train_un)
XGBClassifier(random_state=1)
# to check performance of the model training data
xgb_undersampled_model_train_perf = model_performance_classification_sklearn(
xgb2, X_train, y_train
)
xgb_undersampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95 | 0.93 | 0.54 | 0.68 |
# to check performance of the model on the validation data
xgb_undersampled_model_val_perf = model_performance_classification_sklearn(
xgb2, X_val, y_val
)
xgb_undersampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.96 | 0.90 | 0.59 | 0.71 |
# Confusion matrix for undersampled train data
cm = confusion_matrix(y_train_over, xgb2.predict(X_train_over))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# Confusion matrix for validation data
cm = confusion_matrix(y_val, xgb2.predict(X_val))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
For XGBoost: param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': [150,200,250],
'learning_rate': [0.1, 0.2],
'subsample' : [0.8,0.9],
'scale_pos_weight' : [5,10],
'gamma' : [0,3,5],
}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 5} with CV score=0.8507042253521127:
# Set the clf to the best combination of parameters
xgb1_tuned = XGBClassifier(
n_estimators=200,
subsample=0.9,
scale_pos_weight=10,
learning_rate=0.1,
gamma=5,
)
# Fit the best algorithm to the data.
xgb1_tuned.fit(X_train, y_train)
XGBClassifier(gamma=5, n_estimators=200, scale_pos_weight=10, subsample=0.9)
# to check performance of the model on the training data
xgb_tuned_default_model_train_perf = model_performance_classification_sklearn(
xgb1_tuned, X_train, y_train
)
xgb_tuned_default_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.97 | 0.91 | 0.93 |
# to check performance of the model on the validation data
xgb_tuned_default_model_val_perf = model_performance_classification_sklearn(
grb1_tuned, X_val, y_val
)
xgb_tuned_default_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.80 | 0.87 | 0.83 |
The tuned default model is overfitting
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': [150,200,250],
'learning_rate': [0.1, 0.2],
'subsample' : [0.8,0.9],
'scale_pos_weight' : [5,10],
'gamma' : [0,3,5],
}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.2, 'gamma': 0} with CV score=0.9941273779983458:
# Set the clf to the best combination of parameters
xgb2_tuned = XGBClassifier(
n_estimators=200,
subsample=0.9,
scale_pos_weight=10,
learning_rate=0.2,
gamma=0,
)
# Fit the best algorithm to the data.
xgb2_tuned.fit(X_train_over, y_train_over)
XGBClassifier(learning_rate=0.2, n_estimators=200, scale_pos_weight=10,
subsample=0.9)
# to check performance of the model on the training data
xgb_tuned_oversampled_model_train_perf = model_performance_classification_sklearn(
xgb2_tuned, X_train, y_train
)
xgb_tuned_oversampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.94 | 1.00 | 0.50 | 0.66 |
# to check performance of the model on the validation data
xgb_tuned_oversampled_model_val_perf = model_performance_classification_sklearn(
xgb2_tuned, X_val, y_val
)
xgb_tuned_oversampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.93 | 0.92 | 0.42 | 0.58 |
# defining model
Model = XGBClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': [150,200,250],
'learning_rate': [0.1, 0.2],
'subsample' : [0.8,0.9],
'scale_pos_weight' : [5,10],
'gamma' : [0,3,5],
}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 5} with CV score=0.9225352112676056:
# Set the clf to the best combination of parameters
xgb3_tuned = XGBClassifier(
n_estimators=200,
subsample=0.9,
scale_pos_weight=10,
learning_rate=0.1,
gamma=5,
)
# Fit the best algorithm to the data.
xgb3_tuned.fit(X_train_un, y_train_un)
XGBClassifier(gamma=5, n_estimators=200, scale_pos_weight=10, subsample=0.9)
# to check performance of the model on the training data
xgb_tuned_undersampled_model_train_perf = model_performance_classification_sklearn(
xgb3_tuned, X_train, y_train
)
xgb_tuned_undersampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.84 | 1.00 | 0.26 | 0.41 |
# to check performance of the model on the validation data
xgb_tuned_undersampled_model_val_perf = model_performance_classification_sklearn(
xgb3_tuned, X_val, y_val
)
xgb_tuned_undersampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.83 | 0.94 | 0.24 | 0.39 |
# training performance comparison
models_train_comp_df = pd.concat(
[
xgb_default_model_train_perf.T,
xgb_oversampled_model_train_perf.T,
xgb_undersampled_model_train_perf.T,
xgb_tuned_default_model_train_perf.T,
xgb_tuned_oversampled_model_train_perf.T,
xgb_tuned_undersampled_model_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Default XGBoost",
"XGBoost with oversampled data",
"XGBoost with undersampled data",
"Tuned Default XGBoost",
"Tuned XGBoost with oversampled data",
"Tuned XGBoost with undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Default XGBoost | XGBoost with oversampled data | XGBoost with undersampled data | Tuned Default XGBoost | Tuned XGBoost with oversampled data | Tuned XGBoost with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.99 | 0.97 | 0.95 | 0.99 | 0.94 | 0.84 |
| Recall | 0.81 | 0.90 | 0.93 | 0.97 | 1.00 | 1.00 |
| Precision | 0.99 | 0.69 | 0.54 | 0.91 | 0.50 | 0.26 |
| F1 | 0.89 | 0.79 | 0.68 | 0.93 | 0.66 | 0.41 |
# Validation performance comparison
models_val_comp_df = pd.concat(
[
xgb_default_model_val_perf.T,
xgb_oversampled_model_val_perf.T,
xgb_undersampled_model_val_perf.T,
xgb_tuned_default_model_val_perf.T,
xgb_tuned_oversampled_model_val_perf.T,
xgb_tuned_undersampled_model_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Default XGBoost",
"XGBoost with oversampled data",
"XGBoost with undersampled data",
"Tuned Default XGBoost",
"Tuned XGBoost with oversampled data",
"Tuned XGBoost with undersampled data",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Default XGBoost | XGBoost with oversampled data | XGBoost with undersampled data | Tuned Default XGBoost | Tuned XGBoost with oversampled data | Tuned XGBoost with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.99 | 0.97 | 0.96 | 0.98 | 0.93 | 0.83 |
| Recall | 0.78 | 0.89 | 0.90 | 0.80 | 0.92 | 0.94 |
| Precision | 0.98 | 0.70 | 0.59 | 0.87 | 0.42 | 0.24 |
| F1 | 0.87 | 0.78 | 0.71 | 0.83 | 0.58 | 0.39 |
The best performing models on the validation data are Untuned XGBoost model with undersampled data and Tuned XGBoost model with undersampled data
Untuned XGBoost model with undersampled data model is the selected model because it generalizes better
# to check performance of the model on the test data
xgb_undersampled_model_test_perf = model_performance_classification_sklearn(
xgb2, X_test, y_test
)
xgb_undersampled_model_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.94 | 0.87 | 0.49 | 0.63 |
# Confusion matrix for the selected model on test data
cm = confusion_matrix(y_test, xgb2.predict(X_test))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# importance of features in the tree building
print(pd.DataFrame(xgb2.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp V18 0.14 V36 0.10 V39 0.09 V14 0.06 V3 0.04 V26 0.04 V40 0.04 V11 0.03 V16 0.03 V12 0.02 V35 0.02 V15 0.02 V1 0.02 V27 0.02 V8 0.02 V37 0.02 V25 0.02 V38 0.02 V20 0.02 V9 0.02 V33 0.01 V13 0.01 V10 0.01 V5 0.01 V30 0.01 V29 0.01 V19 0.01 V34 0.01 V21 0.01 V24 0.01 V31 0.01 V7 0.01 V4 0.01 V28 0.01 V6 0.01 V22 0.01 V2 0.01 V23 0.01 V17 0.01 V32 0.01
feature_names = X_train.columns
importances = xgb2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("dtree", BaggingClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset: dtree: 0.7014084507042254 Validation Performance: dtree: 0.6966292134831461
# to check performance of the model on training data
bgc_default_model_train_perf = model_performance_classification_sklearn(
model, X_train, y_train
)
bgc_default_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00 | 0.96 | 1.00 | 0.98 |
# to check performance of the model on validation data
bgc_default_model_val_perf = model_performance_classification_sklearn(
model, X_val, y_val
)
bgc_default_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.70 | 0.97 | 0.81 |
The default model is overfitting
bgc1 = BaggingClassifier(random_state=1)
# training the gradient boost model with oversampled training set
bgc1.fit(X_train_over, y_train_over)
BaggingClassifier(random_state=1)
# to check performance of the model training data
bgc_oversampled_model_train_perf = model_performance_classification_sklearn(
bgc1, X_train, y_train
)
bgc_oversampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00 | 1.00 | 0.99 | 1.00 |
# to check performance of the model on the validation data
bgc_oversampled_model_val_perf = model_performance_classification_sklearn(
bgc1, X_val, y_val
)
bgc_oversampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.81 | 0.84 | 0.83 |
# Confusion matrix for oversampled train data
cm = confusion_matrix(y_train_over, bgc1.predict(X_train_over))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# Confusion matrix for validation data
cm = confusion_matrix(y_val, bgc1.predict(X_val))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
The oversampled model is also overfitting
bgc2 = BaggingClassifier(random_state=1)
# training the gradient boost model with undersampled training set
bgc2.fit(X_train_un, y_train_un)
BaggingClassifier(random_state=1)
# to check performance of the model training data
bgc_undersampled_model_train_perf = model_performance_classification_sklearn(
bgc2, X_train, y_train
)
bgc_undersampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95 | 0.98 | 0.52 | 0.68 |
# to check performance of the model on the validation data
bgc_undersampled_model_val_perf = model_performance_classification_sklearn(
bgc2, X_val, y_val
)
bgc_undersampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.94 | 0.87 | 0.48 | 0.62 |
# Confusion matrix for undersampled train data
cm = confusion_matrix(y_train_over, bgc2.predict(X_train_over))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# Confusion matrix for validation data
cm = confusion_matrix(y_val, bgc2.predict(X_val))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
For Bagging Classifier: param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }
# defining model
Model = BaggingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_samples': [0.8,0.9,1],
'max_features': [0.7,0.8,0.9],
'n_estimators' : [30,50,70],
}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 70, 'max_samples': 0.9, 'max_features': 0.9} with CV score=0.7295774647887323:
# Set the clf to the best combination of parameters
bgc1_tuned = XGBClassifier(
n_estimators=70,
max_features=0.9,
max_samples=0.9,
)
# Fit the best algorithm to the data.
bgc1_tuned.fit(X_train, y_train)
XGBClassifier(max_features=0.9, max_samples=0.9, n_estimators=70)
# to check performance of the model on the training data
bgc_tuned_default_model_train_perf = model_performance_classification_sklearn(
bgc1_tuned, X_train, y_train
)
bgc_tuned_default_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99 | 0.77 | 0.99 | 0.87 |
# to check performance of the model on the validation data
bgc_tuned_default_model_val_perf = model_performance_classification_sklearn(
bgc1_tuned, X_val, y_val
)
bgc_tuned_default_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98 | 0.74 | 0.98 | 0.84 |
# defining model
Model = BaggingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_samples': [0.8,0.9,1],
'max_features': [0.7,0.8,0.9],
'n_estimators' : [30,50,70],
}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 70, 'max_samples': 0.9, 'max_features': 0.8} with CV score=0.9825475599669147:
# Set the clf to the best combination of parameters
bgc2_tuned = XGBClassifier(
n_estimators=70,
max_features=0.8,
max_samples=0.9,
)
# Fit the best algorithm to the data.
bgc2_tuned.fit(X_train_over, y_train_over)
XGBClassifier(max_features=0.8, max_samples=0.9, n_estimators=70)
# to check performance of the model on the training data
bgc_tuned_oversampled_model_train_perf = model_performance_classification_sklearn(
bgc2_tuned, X_train, y_train
)
bgc_tuned_oversampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.96 | 0.90 | 0.63 | 0.74 |
# to check performance of the model on the validation data
bgc_tuned_oversampled_model_val_perf = model_performance_classification_sklearn(
bgc2_tuned, X_val, y_val
)
bgc_tuned_oversampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97 | 0.88 | 0.67 | 0.76 |
# defining model
Model = BaggingClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_samples': [0.8,0.9,1],
'max_features': [0.7,0.8,0.9],
'n_estimators' : [30,50,70],
}
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=kfold, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 70, 'max_samples': 0.9, 'max_features': 0.9} with CV score=0.8704225352112676:
# Set the clf to the best combination of parameters
bgc3_tuned = XGBClassifier(
n_estimators=70,
max_features=0.9,
max_samples=0.9,
)
# Fit the best algorithm to the data.
bgc3_tuned.fit(X_train_un, y_train_un)
XGBClassifier(max_features=0.9, max_samples=0.9, n_estimators=70)
# to check performance of the model on the training data
bgc_tuned_undersampled_model_train_perf = model_performance_classification_sklearn(
bgc3_tuned, X_train, y_train
)
bgc_tuned_undersampled_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95 | 0.91 | 0.50 | 0.65 |
# to check performance of the model on the validation data
bgc_tuned_undersampled_model_val_perf = model_performance_classification_sklearn(
bgc3_tuned, X_val, y_val
)
bgc_tuned_undersampled_model_val_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95 | 0.89 | 0.54 | 0.67 |
# training performance comparison
models_train_comp_df = pd.concat(
[
bgc_default_model_train_perf.T,
bgc_oversampled_model_train_perf.T,
bgc_undersampled_model_train_perf.T,
bgc_tuned_default_model_train_perf.T,
bgc_tuned_oversampled_model_train_perf.T,
bgc_tuned_undersampled_model_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Default Bagging Classifier",
"Bagging Classifier with oversampled data",
"Bagging Classifier with undersampled data",
"Tuned Default Bagging Classifier",
"Tuned Bagging Classifier with oversampled data",
"Tuned Bagging Classifier with undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Default Bagging Classifier | Bagging Classifier with oversampled data | Bagging Classifier with undersampled data | Tuned Default Bagging Classifier | Tuned Bagging Classifier with oversampled data | Tuned Bagging Classifier with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 1.00 | 1.00 | 0.95 | 0.99 | 0.96 | 0.95 |
| Recall | 0.96 | 1.00 | 0.98 | 0.77 | 0.90 | 0.91 |
| Precision | 1.00 | 0.99 | 0.52 | 0.99 | 0.63 | 0.50 |
| F1 | 0.98 | 1.00 | 0.68 | 0.87 | 0.74 | 0.65 |
# Validation performance comparison
models_val_comp_df = pd.concat(
[
bgc_default_model_val_perf.T,
bgc_oversampled_model_val_perf.T,
bgc_undersampled_model_val_perf.T,
bgc_tuned_default_model_val_perf.T,
bgc_tuned_oversampled_model_val_perf.T,
bgc_tuned_undersampled_model_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Default Bagging Classifier",
"Bagging Classifier with oversampled data",
"Bagging Classifier with undersampled data",
"Tuned Default Bagging Classifier",
"Tuned Bagging Classifier with oversampled data",
"Tuned Bagging Classifier with undersampled data",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Default Bagging Classifier | Bagging Classifier with oversampled data | Bagging Classifier with undersampled data | Tuned Default Bagging Classifier | Tuned Bagging Classifier with oversampled data | Tuned Bagging Classifier with undersampled data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.98 | 0.98 | 0.94 | 0.98 | 0.97 | 0.95 |
| Recall | 0.70 | 0.81 | 0.87 | 0.74 | 0.88 | 0.89 |
| Precision | 0.97 | 0.84 | 0.48 | 0.98 | 0.67 | 0.54 |
| F1 | 0.81 | 0.83 | 0.62 | 0.84 | 0.76 | 0.67 |
The best performing models on the validation data are Untuned Bagging Classifier model with undersampled data and Tuned Bagging Classifier model with undersampled data
Tuned Bagging Classifier model with undersampled data model is the selected model because it generalizes better and has a better recall performance
# to check performance of the model on the test data
bgc_tuned_undersampled_model_test_perf = model_performance_classification_sklearn(
bgc3_tuned, X_test, y_test
)
bgc_tuned_undersampled_model_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.94 | 0.86 | 0.46 | 0.60 |
# Confusion matrix for the selected model on test data
cm = confusion_matrix(y_test, bgc3_tuned.predict(X_test))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
# importance of features in the tree building
print(pd.DataFrame(bgc3_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp V18 0.15 V36 0.09 V39 0.09 V14 0.05 V3 0.04 V26 0.04 V15 0.03 V40 0.03 V31 0.03 V16 0.03 V11 0.03 V25 0.03 V12 0.03 V35 0.02 V5 0.02 V20 0.02 V1 0.02 V13 0.02 V9 0.01 V34 0.01 V27 0.01 V8 0.01 V37 0.01 V10 0.01 V38 0.01 V33 0.01 V23 0.01 V24 0.01 V30 0.01 V21 0.01 V4 0.01 V29 0.01 V19 0.01 V7 0.01 V28 0.01 V6 0.01 V2 0.01 V22 0.01 V32 0.01 V17 0.01
feature_names = X_train.columns
importances = bgc3_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
# Validation performance comparison
models_val_comp_df = pd.concat(
[
dtree_undersampled_model_val_perf.T,
rf3_tuned_undersampled_model_val_perf.T,
lgr_undersampled_model_val_perf.T,
adb_undersampled_model_val_perf.T,
grb_oversampled_model_val_perf.T,
xgb_undersampled_model_val_perf.T,
bgc_tuned_undersampled_model_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Decision Tree with undersampled data",
"Tuned Random Forest with undersampled data",
"Logistic Regression with undersampled data",
"AdaBoost with undersampled data",
"Gradient Boost with oversampled data",
"XGBoost with undersampled data",
"Tuned Bagging Classifier with undersampled data",
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Decision Tree with undersampled data | Tuned Random Forest with undersampled data | Logistic Regression with undersampled data | AdaBoost with undersampled data | Gradient Boost with oversampled data | XGBoost with undersampled data | Tuned Bagging Classifier with undersampled data | |
|---|---|---|---|---|---|---|---|
| Accuracy | 0.92 | 0.96 | 0.87 | 0.89 | 0.97 | 0.96 | 0.95 |
| Recall | 0.85 | 0.92 | 0.89 | 0.89 | 0.90 | 0.90 | 0.89 |
| Precision | 0.39 | 0.59 | 0.29 | 0.33 | 0.69 | 0.59 | 0.54 |
| F1 | 0.53 | 0.71 | 0.44 | 0.48 | 0.78 | 0.71 | 0.67 |
# Test performance comparison
models_test_comp_df = pd.concat(
[
dtree_undersampled_model_test_perf.T,
rf3_tuned_undersampled_model_test_perf.T,
lgr_undersampled_model_test_perf.T,
adb_undersampled_model_test_perf.T,
grb_oversampled_model_test_perf.T,
xgb_undersampled_model_test_perf.T,
bgc_tuned_undersampled_model_test_perf.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree with undersampled data",
"Tuned Random Forest with undersampled data",
"Logistic Regression with undersampled data",
"AdaBoost with undersampled data",
"Gradient Boost with oversampled data",
"XGBoost with undersampled data",
"Tuned Bagging Classifier with undersampled data",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Decision Tree with undersampled data | Tuned Random Forest with undersampled data | Logistic Regression with undersampled data | AdaBoost with undersampled data | Gradient Boost with oversampled data | XGBoost with undersampled data | Tuned Bagging Classifier with undersampled data | |
|---|---|---|---|---|---|---|---|
| Accuracy | 0.91 | 0.95 | 0.86 | 0.88 | 0.96 | 0.94 | 0.94 |
| Recall | 0.84 | 0.87 | 0.86 | 0.86 | 0.86 | 0.87 | 0.86 |
| Precision | 0.35 | 0.52 | 0.26 | 0.31 | 0.63 | 0.49 | 0.46 |
| F1 | 0.50 | 0.65 | 0.41 | 0.45 | 0.73 | 0.63 | 0.60 |
Gradient Boost with oversampled data because it has the best combination of recall, accuracy, precision and F1 scoretest_data.head(3)
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | V29 | V30 | V31 | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.61 | -3.82 | 2.20 | 1.30 | -1.18 | -4.50 | -1.84 | 4.72 | 1.21 | -0.34 | -5.12 | 1.02 | 4.82 | 3.27 | -2.98 | 1.39 | 2.03 | -0.51 | -1.02 | 7.34 | -2.24 | 0.16 | 2.05 | -2.77 | 1.85 | -1.79 | -0.28 | -1.26 | -3.83 | -1.50 | 1.59 | 2.29 | -5.41 | 0.87 | 0.57 | 4.16 | 1.43 | -10.51 | 0.45 | -1.45 | 0 |
| 1 | 0.39 | -0.51 | 0.53 | -2.58 | -1.02 | 2.24 | -0.44 | -4.41 | -0.33 | 1.97 | 1.80 | 0.41 | 0.64 | -1.39 | -1.88 | -5.02 | -3.83 | 2.42 | 1.76 | -3.24 | -3.19 | 1.86 | -1.71 | 0.63 | -0.59 | 0.08 | 3.01 | -0.18 | 0.22 | 0.87 | -1.78 | -2.47 | 2.49 | 0.32 | 2.06 | 0.68 | -0.49 | 5.13 | 1.72 | -1.49 | 0 |
| 2 | -0.87 | -0.64 | 4.08 | -1.59 | 0.53 | -1.96 | -0.70 | 1.35 | -1.73 | 0.47 | -4.93 | 3.57 | -0.45 | -0.66 | -0.17 | -1.63 | 2.29 | 2.40 | 0.60 | 1.79 | -2.12 | 0.48 | -0.84 | 1.79 | 1.87 | 0.36 | -0.17 | -0.48 | -2.12 | -2.16 | 2.91 | -1.32 | -3.00 | 0.46 | 0.62 | 5.63 | 1.32 | -1.75 | 1.81 | 1.68 | 0 |
final_test = test_data.copy()
# separating the independent and dependent variables
X1_test = final_test.drop(["Target"], axis=1)
y1_test = final_test["Target"]
final_test.isnull().sum()
V1 5 V2 6 V3 0 V4 0 V5 0 V6 0 V7 0 V8 0 V9 0 V10 0 V11 0 V12 0 V13 0 V14 0 V15 0 V16 0 V17 0 V18 0 V19 0 V20 0 V21 0 V22 0 V23 0 V24 0 V25 0 V26 0 V27 0 V28 0 V29 0 V30 0 V31 0 V32 0 V33 0 V34 0 V35 0 V36 0 V37 0 V38 0 V39 0 V40 0 Target 0 dtype: int64
# Let's impute the missing values
imp_median = KNNImputer(n_neighbors=5)
# fit the imputer on train data and transform the train data
X1_test["V1"] = imp_median.fit_transform(X1_test[["V1"]])
X1_test["V2"] = imp_median.fit_transform(X1_test[["V2"]])
# final test performance of the selected model on the final test data
grb_oversampled_model_final_test_perf = model_performance_classification_sklearn(
grb1, X1_test, y1_test
)
grb_oversampled_model_final_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.96 | 0.85 | 0.63 | 0.72 |
# Confusion matrix for the selected model on test data
cm = confusion_matrix(y1_test, grb1.predict(X1_test))
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
Text(42.0, 0.5, 'Actual Values')
print(pd.DataFrame(grb1.feature_importances_, columns = ["Imp"], index = X1_test.columns).sort_values(by = 'Imp', ascending = False))
Imp V36 0.26 V18 0.17 V14 0.09 V39 0.09 V26 0.06 V16 0.05 V9 0.04 V3 0.03 V15 0.02 V12 0.02 V35 0.02 V7 0.01 V10 0.01 V37 0.01 V1 0.01 V34 0.01 V38 0.01 V21 0.01 V27 0.01 V30 0.01 V11 0.01 V33 0.01 V5 0.01 V6 0.00 V13 0.00 V32 0.00 V4 0.00 V17 0.00 V24 0.00 V40 0.00 V2 0.00 V20 0.00 V28 0.00 V22 0.00 V29 0.00 V8 0.00 V31 0.00 V23 0.00 V19 0.00 V25 0.00
feature_names = X1_test.columns
importances = grb1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Based on the final test data and training data
df.columns
Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31',
'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40',
'Target'],
dtype='object')
# creating a list of numerical variables
numerical_features = ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31',
'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40',
]
# creating a transformer for numerical variables, which will apply KNN imputer on the numerical variables
numeric_transformer = Pipeline(
steps=[
("imputer", KNNImputer(n_neighbors=5)),
]
)
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numerical_features)],
remainder="passthrough",
)
# splitting the test and train data
train_X = df.drop(["Target"], axis=1)
train_y = df['Target']
test_X = test_data.drop(["Target"], axis=1)
test_y = test_data['Target']
# Let's impute the missing values
imp = KNNImputer(n_neighbors=5)
# fit the imputer on train data and transform the train data
train_X[['V1','V2']] = imp.fit_transform(train_X[['V1','V2']])
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
train_X_over, train_y_over = sm.fit_resample(train_X, train_y)
pipe = make_pipeline(preprocessor, GradientBoostingClassifier())
pipe.fit(train_X_over, train_y_over)
Pipeline(steps=[('columntransformer',
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('imputer',
KNNImputer())]),
['V1', 'V2', 'V3', 'V4', 'V5',
'V6', 'V7', 'V8', 'V9',
'V10', 'V11', 'V12', 'V13',
'V14', 'V15', 'V16', 'V17',
'V18', 'V19', 'V20', 'V21',
'V22', 'V23', 'V24', 'V25',
'V26', 'V27', 'V28', 'V29',
'V30', ...])])),
('gradientboostingclassifier', GradientBoostingClassifier())])
Model_test = model_performance_classification_sklearn(pipe, test_X, test_y)
Model_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.96 | 0.85 | 0.63 | 0.73 |